for loop - Python Document Comparison - returning ALL words NOT IN other document -
i'm trying create "translation comparison" program reads , compares 2 documents , returns all words in 1 document aren't in other document. right now, program returns first instance of word in 'file1' not being in 'file2'. beginner class, i'm trying avoid using obscure internal methods, if means less efficient code. have far...
def translation_comparison(): import re file1 = open("desktop/file1.txt","r") file2 = open("desktop/file2.txt","r") text1 = file1.read() text2 = file2.read() text1 = re.findall(r'\w+',text1) text2 = re.findall(r'\w+',text2) item in text2: if item not in text1: return item
you can this:
def translation_comparison(): import re file1 = open("text1.txt","r") file2 = open("text2.txt","r") text1 = file1.read() text2 = file2.read() text1 = re.findall(r'\w+',text1) text2 = re.findall(r'\w+',text2) # added lines below text1 = list(set(text1)) text2 = list(set(text2)) word in text2: if word in text1: text1.remove(word) return text1
take starting @ comment. first take set lists of words in each document. leaves list of unique words, in case there duplicates. next, loop through each word in second text, , if word exists in first text well, remove list of words in first text. @ end, we'll left words in text1
not in text2
. return list @ end, contains words.
let me know if makes sense, or if have questions.
edit: per suggestion @blckknght, simpler way use set subtraction follows:
def translation_comparison(): import re file1 = open("text1.txt","r") file2 = open("text2.txt","r") text1 = file1.read() text2 = file2.read() text1 = re.findall(r'\w+',text1) text2 = re.findall(r'\w+',text2) return list(set(text1) - set(text2))
also note considers same word capitalized differently (ex: the
vs the
) separate words. although simple fix basic list comprehension: text1 = [x.lower() x in text1]
, text2 = [x.lower() x in text2]
.
Comments
Post a Comment