Question

我正在寻找在一篇文章中找到一包词中所有字数的次数。我对每个单词的频率不感兴趣，但对文章中找到所有单词的总次数不感兴趣。当我从互联网上检索它们时，我必须分析数百篇文章。我的算法需要很长时间，因为每篇文章大约800字。

以下是我的工作（金额是在一篇文章中找到这些字词的次数，文章包含一个字符串，其中包含构成文章内容的所有单词，我使用NLTK进行标记。）

bag_of_words = tokenize(bag_of_words)
tokenized_article = tokenize(article)

occurrences = [word for word in tokenized_article
                    if word in bag_of_words]

amount = len(occurrences)

tokenized_article 的位置如下：

[u'sarajevo', u'bosnia', u'herzegovi', u'war', ...]

bag_of_words 也是如此。

我想知道是否有更高效/更快的方式使用NLTK或lambda函数。例如。

Answer 1

我建议您使用set来计算您正在计算的单词 - set具有常量时间成员资格测试，因此比使用列表（具有线性时间成员资格测试）更快

例如：

occurrences = [word for word in tokenized_article
                    if word in set(bag_of_words)]

amount = len(occurrences)

一些时间测试（使用人工创建的列表，重复十次）：

In [4]: words = s.split(' ') * 10

In [5]: len(words)
Out[5]: 1060

In [6]: to_match = ['NTLK', 'all', 'long', 'I']

In [9]: def f():
   ...:     return len([word for word in words if word in to_match])

In [13]: timeit(f, number = 10000)
Out[13]: 1.0613768100738525

In [14]: set_match = set(to_match)

In [15]: def g():
    ...:     return len([word for word in words if word in set_match])

In [18]: timeit(g, number = 10000)
Out[18]: 0.6921310424804688

其他一些测试：

In [22]: p = re.compile('|'.join(set_match))

In [23]: p
Out[23]: re.compile(r'I|all|NTLK|long')

In [24]: p = re.compile('|'.join(set_match))

In [28]: def h():
    ...:     return len(filter(p.match, words))

In [29]: timeit(h, number = 10000)
Out[29]: 2.2606470584869385

Answer 2

使用集合进行成员资格测试。

另一种检查方法可以是计算每个单词的出现次数，如果单词存在则添加出现的单词，假设文章包含重复单词的频率，并且文章不是很短。我们先说吧一篇文章包含10＆＃34;＆＃34;，现在我们只检查一次会员而不是10次。

from collections import Counter
def f():
    return sum(c for word, c in Counter(check).items() if word in words)

Answer 3

如果你不想要点数，那就不是＆＃34;词袋＆＃34;再一次，但一组单词。因此，如果确实是这种情况，请将您的文档转换为set 。

避免使用for循环和lambda函数，特别是嵌套函数。这需要大量的解释器工作，并且速度很慢。相反，尝试使用优化调用，例如intersection（为了性能，诸如numpy之类的库也非常好，因为它们在低级C / Fortran / Cython代码中工作）

即

count = len(bag_of_words_set.intersection( set(tokenized_article) ))

其中word_set是您感兴趣的字词，set。

如果您想要经典字数，请使用collections.Counter：

from collections import Counter counter = Counter() ... counter.update(tokenized_article)

这会计算所有字词，包括那些不在您列表中的字词。你可以试试这个，但由于循环可能会变慢：

bag_of_words_set = set(bag_of_words) ... for w in tokenized_article: if w in bag_of_words_set: # use a set, not a list! counter[w] += 1

有点复杂，但可能更快，是使用两个Counter。一份，一份用于文件。

doc_counter.clear() doc_counter.update( tokenized_article ) for w in doc_counter.keys(): if not w in bag_of_words_set: del doc_counter[w] counter.update(doc_counter) # untested.

如果您有许多重复的不需要的单词，则可以使用计数器来保存文档。它对多线程操作（更容易同步）也更好

使用python计算文章中单词列表的最快方法

3 个答案: