使用Python的2个文件之间最常见的单词

时间:2015-06-05 07:48:43

标签: python python-2.7

我是Python的新手,并尝试编写脚本,找到2个文件之间最常见的常用词。我能够分别找到2个文件之间最常见的单词,但不知道如何计算让我们说两个文件中常见的前5个单词?需要找到常用词,并且两个文件之间的常用词的频率也应该更高。

import re
from collections import Counter


finalLineLower=''
with open("test3.txt", "r") as hfFile:
        for line in hfFile:
            finalLine = re.sub('[,.<;:)-=!>_(?"]', '', line)            
            finalLineLower += finalLine.lower()
            words1 = finalLineLower.split()

f = open('test2.txt', 'r')
sWords = [line.strip() for line in f]


finalLineLower1=''
with open("test4.txt", "r") as tsFile:
        for line in tsFile:
            finalLine = re.sub('[,.<;:)-=!>_(?"]', '', line)            
            finalLineLower1 += finalLine.lower()
            words = finalLineLower1.split()
#print (words)
mc = Counter(words).most_common()
mc2 = Counter(words1).most_common()

print(len(mc))
print(len(mc2))

示例test3和test4文件如下。 TEST3:

Essays are generally scholarly pieces of writing giving the author's own argument, but the definition is vague, overlapping with those of an article, a pamphlet and a short story.

TEST4:

Essays are generally scholarly pieces of writing giving the author's own argument, but the definition is vague, overlapping with those of an article, a pamphlet and a short story.

Essays can consist of a number of elements, including: literary criticism, political manifestos, learned arguments, observations of daily life, recollections, and reflections of the author. Almost all modern essays are written in prose, but works in verse have been dubbed essays (e.g. Alexander Pope's An Essay on Criticism and An Essay on Man). While brevity usually defines an essay, voluminous works like John Locke's An Essay Concerning Human Understanding and Thomas Malthus's An Essay on the Principle of Population are counterexamples. In some countries (e.g., the United States and Canada), essays have become a major part of formal education. Secondary students are taught structured essay formats to improve their writing skills, and admission essays are often used by universities in selecting applicants and, in the humanities and social sciences, as a way of assessing the performance of students during final exams.

2 个答案:

答案 0 :(得分:2)

您只需找到Counter对象与&操作数之间的交集:

mc = Counter(words)
mc2 = Counter(words1)
total=mc&mc2
mos=total.most_common(N)

示例:

>>> d1={'a':5,'f':2,'c':1,'h':2,'t':4}
>>> d2={'a':3,'b':2,'e':1,'h':5,'t':6}
>>> c1=Counter(d1)
>>> c2=Counter(d2)
>>> t=c1&c2
>>> t
Counter({'t': 4, 'a': 3, 'h': 2})
>>> t.most_common(2)
[('t', 4), ('a', 3)]

但请注意&返回计数器之间的最小计数,您还可以使用返回最大计数的union |,并且您可以使用简单的dict理解来获取最大计数:

>>> m=c1|c2
>>> m
Counter({'t': 6, 'a': 5, 'h': 5, 'b': 2, 'f': 2, 'c': 1, 'e': 1})
>>> max={i:j for i,j in m.items() if i in t}
>>> max
{'a': 5, 'h': 5, 't': 6}

最后,如果你想要常用词的总和,你可以将你的计数器加在一起:

>>> s=Counter(max)+t
>>> s
Counter({'t': 10, 'a': 8, 'h': 7})

答案 1 :(得分:1)

这个问题含糊不清。

你可能会要求两个文件中最常见的单词 - 例如,在file1中出现10000次而在file2中出现1次的单词计为出现10001次。在那种情况下:

mc = Counter(words) + Counter(words1) # or Counter(chain(words, words1))
mos = mc.most_common(5)

或者您可能会要求 文件中最常见的单词,这些单词在另一个文件中至少出现一次:

mc = Counter(words)
mc1 = Counter(words1)
mcmerged = Counter({word: max(mc[word], mc1[word]) for word in mc if word in mc1})
mos = mcmerged.most_common(5)

或两个文件中最常见的一起,但前提是它们在每个文件中至少出现一次:

mc = Counter(words)
mc1 = Counter(words1)
mcmerged = Counter({word: mc[word] + mc1[word] for word in mc if word in mc1})

可能还有其他方法可以解释。如果你能用明确的英语来表达规则,那么将它翻译成Python应该很容易;如果你不能这样做,那将是不可能的。

根据您的评论,听起来您实际上并未阅读此答案中的代码,并尝试使用mc = Counter(words).most_common()代替mc = Counter(words)mc = Counter(words) + Counter(words1)等。这个答案。当您在most_common()上致电Counter时,您会收到list,而不是Counter。只是......不要这样做,做实际的代码。