如何将文本文件中最常见的单词与其他文本文件进行比较

时间:2015-08-09 11:34:44

标签: python-2.7

我有两个文本文件。从textfile1,我选择了50个最常用的单词。现在我想搜索这50个最常用的单词。

readFile = open('textfile1.text', 'r')
sepFile = readFile.read()
words = re.findall('\w+', sepFile)
for word in [words]:
word_long = [w for w in word if len(w) > 3]
word_count = Counter(word_long).most_common(50)
count = word_count
list1=count

readFile1 = open('textfile2.txt', 'r')
sepFile1 = readFile1.read()
word2 = re.findall('\w+', sepFile1)
for word in [word2]:
word_long1 = [w for w in word if len(w) > 3]
word_count1 = Counter(word_long1).most_common(50)
count2 = word_count1
list1=count2
a=words1
c=Counter(a)
for w in words:
print w, c.get(w,0)  

1 个答案:

答案 0 :(得分:1)

使用dictionaries可能会有所帮助。 Counter.most_common()会返回一个元组列表,您可以将其转换为dict

file1_common_words = dict(Counter(all_words_in_file1).most_common(50))
file2_common_words = dict(Counter(all_words_in_file2).most_common(50))

然后,对于file1_common_words中的每个字词,您可以在file2_common_words中查找该字词,以便在文件2中计算:

for (word, count) in file1_common_words.items():
    try: 
        count_in_file2 = file2_common_words[word]
    except KeyError: 
        # if the word is not present file2_common_words,
        # then its count is 0.
        count_in_file2 = 0 
    print("{0}\t{1}\t{2}".format(word, count, count_in_file2))

这将输出以下格式的行:

<most_common_word_1>    <count_in_file1>    <count_in_file2>
<most_common_word_2>    <count_in_file1>    <count_in_file2>
...