Question

所以我有1亿个句子，对于每个句子，我想看看它是否包含6000个较小的句子中的一个（仅匹配整个单词）。到目前为止，我的代码是

smaller_sentences = [...]
for large_sentence in file:
    for small_sentence in smaller_sentences:
        if ((' ' + small_sentence + ' ') in large_sentence)
                or (large_sentence.startswith(small_sentence + ' ')
                or (large_sentence.endswith(' ' + small_sentence):
            outfile.write(large_sentence)
            break

但是这段代码的运行速度非常慢。你知道更快的方法吗？

Answer 1

不了解更多关于域名（单词/句子长度），读/写/查询的频率以及算法周围的细节。

但是，首先你可以改变你的状况。

这会检查整个字符串（慢），然后是头（快），然后是尾（快）。

((' ' + small_sentence + ' ') in large_sentence)
        or (large_sentence.startswith(small_sentence + ' ')
        or (large_sentence.endswith(' ' + small_sentence):

这将检查头部然后是尾部（快速），然后是头部（快速），然后检查整个字符串。在Big-O意义上没有大的冲击，但如果你知道字符串在开始或结束时可能更有可能，它可能会增加一些速度。

(large_sentence.startswith(small_sentence + ' ')
        or (large_sentence.endswith(' ' + small_sentence)
        or ((' ' + small_sentence + ' ') in large_sentence)

测试句子是否包含较小的句子

1 个答案: