for loop - Python Document Comparison - returning ALL words NOT IN other document -

- April 15, 2010

i'm trying create "translation comparison" program reads , compares 2 documents , returns all words in 1 document aren't in other document. right now, program returns first instance of word in 'file1' not being in 'file2'. beginner class, i'm trying avoid using obscure internal methods, if means less efficient code. have far...

def translation_comparison():    import re    file1 = open("desktop/file1.txt","r")    file2 = open("desktop/file2.txt","r")    text1 = file1.read()    text2 = file2.read()    text1 = re.findall(r'\w+',text1)    text2 = re.findall(r'\w+',text2)    item in text2:        if item not in text1:            return item

you can this:

def translation_comparison():    import re    file1 = open("text1.txt","r")    file2 = open("text2.txt","r")    text1 = file1.read()    text2 = file2.read()    text1 = re.findall(r'\w+',text1)    text2 = re.findall(r'\w+',text2)    # added lines below    text1 = list(set(text1))    text2 = list(set(text2))    word in text2:     if word in text1:         text1.remove(word)    return text1

take starting @ comment. first take set lists of words in each document. leaves list of unique words, in case there duplicates. next, loop through each word in second text, , if word exists in first text well, remove list of words in first text. @ end, we'll left words in text1 not in text2. return list @ end, contains words.

let me know if makes sense, or if have questions.

edit: per suggestion @blckknght, simpler way use set subtraction follows:

def translation_comparison():    import re    file1 = open("text1.txt","r")    file2 = open("text2.txt","r")    text1 = file1.read()    text2 = file2.read()    text1 = re.findall(r'\w+',text1)    text2 = re.findall(r'\w+',text2)    return list(set(text1) - set(text2))

also note considers same word capitalized differently (ex: the vs the) separate words. although simple fix basic list comprehension: text1 = [x.lower() x in text1] , text2 = [x.lower() x in text2].

Search This Blog

Sort

for loop - Python Document Comparison - returning ALL words NOT IN other document -

Comments

Post a Comment

Popular posts from this blog

node.js - Mongoose: Cast to ObjectId failed for value on newly created object after setting the value -

how does one get csharp-sqlite to throw exceptions for duplicates or foreign key constraint violations -

Simple Angular 2 project fails 'Unexpected reserved word' -