Question

我有一个大约300个单词的列表和大量要扫描的文本，以了解每个单词出现的次数。

我正在使用python中的re模块：

for word in list_word:
    search = re.compile(r"""(\s|,)(%s).?(\s|,|\.|\))""" % word)
    occurrences = search.subn("", text)[1]

但我想知道是否有更高效或更优雅的方式来做这件事？

Answer 1

如果您有大量文本，我不会在这种情况下使用正则表达式而只是分割文本：

words = {"this": 0, "that": 0}
for w in text.split():
  if w in words:
    words[w] += 1

单词将为您提供每个单词的频率

Answer 2

尝试从文本中删除所有标点符号，然后在空白处拆分。然后简单地做

for word in list_word:
    occurence = strippedText.count(word)

或者，如果您使用的是python 3.0，我认为您可以这样做：

occurences = {word: strippedText.count(word) for word in list_word}

Answer 3

谷歌搜索：python频率给我这个页面作为第一个结果：http://www.daniweb.com/code/snippet216747.html

这似乎是你正在寻找的。

Answer 4

您还可以将文本拆分为单词并搜索结果列表。

Answer 5

正则表达式可能不是您想要的。 Python有许多内置的字符串操作，更快，我相信.count（）可以满足您的需求。

http://docs.python.org/library/stdtypes.html#string-methods

Answer 6

如果Python不是必须的，你可以使用awk

$ cat file
word1
word2
word3
word4

$ cat file1
blah1 blah2 word1 word4 blah3 word2
junk1 junk2 word2 word1 junk3
blah4 blah5 word3 word6 end

$ awk 'FNR==NR{w[$1];next} {for(i=1;i<=NF;i++) a[$i]++}END{for(i in w){ if(i in a) print i,a[i] } } ' file file1
word1 2
word2 2
word3 1
word4 1

Answer 7

听起来像自然语言工具包可能有你需要的东西。

http://www.nltk.org/

Answer 8

也许你可以调整我的多重生成器生成器功能。

    from itertools import islice
testline = "Sentence 1.  Sentence 2?  Sentence 3!  Sentence 4.  Sentence 5."
def multis(search_sequence,text,start=0):
    """ multisearch by given search sequence values from text, starting from position start
        yielding tuples of text before sequence item and found sequence item"""
    x=''
    for ch in text[start:]:
        if ch in search_sequence:
            if x: yield (x,ch)
            else: yield ch
            x=''
        else:
            x+=ch
    else:
        if x: yield x

# split the first two sentences by the dot/question/exclamation.
two_sentences = list(islice(multis('.?!',testline),2)) ## must save the result of generation
print "result of split: ", two_sentences

print '\n'.join(sentence.strip()+sep for sentence,sep in two_sentences)

Python：在文本中查找单词列表的最佳/有效方法？

8 个答案: