Question

我有一个小脚本，想要根据几个条件提取一些独特的单词，并检查条件是否超长。

可能是因为它检查了一个大字典，并且它还为每个令牌应用了一个词干分析器。

条件是：

令牌不在选定的词典中
令牌长度超过1
令牌不在一组固定的标点符号中
令牌不是纯数字
令牌不以“'s”

是否有更快的多条件检查实现？任何基于python的解决方案都是可以接受的，即使使用subprocess或cython或调用c / c ++实现也是如此。

请记住，实际上，有更多条件，字典最多可达100,000个条目。我做了类似下面的事情，即使使用yield，链接多个条件也很慢。

import string
from nltk.stem import PorterStemmer

porter = PorterStemmer()

dictionary = ['apple', 'pear', 'orange', 'water', 'eat', 'the', 'with', 'an', 'pie', 'full', 'of', 'water', 'at', 'lake', 'on', 'wednesday', 'plus', 'and', 'many', 'more', 'word']

text = "PEAR eats the Orange, with an Apple's MX2000 full of water - h20 - at Lake 0129 on wednesday."

def extract(txt, dic):
    for i in txt.split():
        _i = i.strip().strip(string.punctuation).lower()
        if _i not in dic and len(_i) > 1 and not _i.isdigit() \
        and porter.stem(_i) not in dictionary and not i.endswith("'s"): 
            yield _i

for i in extract(text, dictionary):
    print i

[OUT]

MX2000
h20

Answer 1

我头顶的两件事：

将字典更改为set（如@Alfe建议的那样）。考虑到数据的长度，这肯定会有助于提高速度。
由于比较将在某个规则为假时立即结束，您可以重新排列测试，以便首先运行最快和/或最具判别力的规则。在这种情况下，最好的顺序并不是我直接明确的。试验它。

如何实现更快的条件检查功能？蟒蛇

1 个答案: