Question

正如标题所说，我有一个单词列表，如stopWords = ["the", "and", "with", etc...]，我收到的文字就像“杀死狐狸和狗”。我希望输出像“杀死狐狸狗”非常有效和快速。我怎么能这样做（我知道我可以使用for循环迭代，但这不是很有效）

Answer 1

最重要的改进是使stopWords成为set 。这意味着查找速度非常快

stopWords = set(["the", "and", "with", etc...])
" ".join(word for word in msg.split() if word not in stopWords)

如果您只是想知道文本中是否有任何停用词

if any(word in stopWords for word in msg.split()):
    ...

Answer 2

使用Python，最快的操作是将“停用词”设置为一组而不是列表，并使用“x in stopwords”直接检查成员身份。这种结构旨在快速进行此类操作。

See the set documentation

Answer 3

使用list comprehension：

stopWords = ["the", "and", "with"]
msg = "kill the fox and the dog"

' '.join([w for w in msg.split() if w not in stopWords])

给出：

'kill fox dog'

Answer 4

将原始单词列表放入字典中。
遍历给定字符串中的字符，使用空格作为单词的分隔符。查找字典中的每个单词。

Answer 5

将你的停用词放在set()中（正如其他人建议的那样），将你的其他单词累积到一个工作集中，然后使用working = working - stopWords ...简单地获取设置差异截止的词语被过滤掉了。或者只是检查这些单词是否存在使用条件。例如：

#!python
stopWords = set('the a an and'.split())
working   = set('this is a test of the one working set dude'.split())
if working == working - stopWords:
    print "The working set contains no stop words"
else:
    print "Actually, it does"

实际上有更高效的数据结构，例如trie可用于大型，相对密集的停用词集。你可以找到Python的trie模块，虽然我没有看到任何编写的二进制（C）扩展，我想知道在纯Python中实现的trie与使用Python set()之间的交叉点之间的交叉点支持。（尽管如此，Cython也可能是个好例子。

事实上，我发现有人在这里单独处理了这个问题SO: How do I create a fixed length mutable array of python objects in cython。

当然，最终，你应该创建简单的基于集合的版本，测试并对其进行分析，然后，如果有必要，尝试trie和Cython-trie变体作为可能的改进。

Answer 6

作为替代方案，您可以在正则表达式中组合列表，并用单个空格替换停用词和周围空格。

import re
stopWords = ["the", "and", "with"]
input = "Kill the fox and dog"
pattern = "\\s{:s}\\s".format("\\s|\\s".join(stopWords))
print(pattern)
print(re.sub(pattern, " ", input))

将输出

\sthe\s|\sand\s|\swith\s
Kill fox dog

如果我有一个单词列表，如何检查字符串是否包含列表中的任何单词，并且有效？

6 个答案: