Question

我正在使用nltk处理文本数据。当我想使用停用词时，通常会使用以下代码。

text_clean = [w for w in text if w.lower() not in stopwords]

但是这段代码总是花费太长时间。（也许我的数据太大了...）
有什么方法可以减少时间？谢谢。

Answer 1

尝试将stopwords转换为集合。使用列表，您的方法是O(n*m)，其中n是文本中的单词数，m是停用词的数量，使用set，方法是O(n + m) 。让我们比较两种方法list与set：

import timeit
from nltk.corpus import stopwords


def list_clean(text):
    stop_words = stopwords.words('english')
    return [w for w in text if w.lower() not in stop_words]


def set_clean(text):
    set_stop_words = set(stopwords.words('english'))
    return [w for w in text if w.lower() not in set_stop_words]

text = ['the', 'cat', 'is', 'on', 'the', 'table', 'that', 'is', 'in', 'some', 'room'] * 100000

if __name__ == "__main__":
    print(timeit.timeit('list_clean(text)', 'from __main__ import text,list_clean', number=5))
    print(timeit.timeit('set_clean(text)', 'from __main__ import text,set_clean', number=5))

输出

7.6629380420199595
0.8327891009976156

在上面的代码中，list_clean是使用list删除停用词的函数，而set_clean是使用set删除停用词的函数。第一次对应于list_clean，第二次对应于set_clean。对于给定的示例，set_clean快将近10倍。

更新

O(n*m)和O(n + m)是big o notation的示例，{{3}}是一种测量算法效率的理论方法。基本上，多项式越大，算法的效率就越低，在这种情况下，O(n*m)比O(n + m)大，因此，list_clean方法在理论上比set_clean方法效率低。此数字来自以下事实：在列表中进行搜索是O(n)，而在set中进行搜索需要固定的时间，通常称为O(1)。

python nltk处理文本，快速删除停用词

1 个答案: