Question

我有一个巨大的字符串：

睡鼠的故事。曾几何时，有三个小姐妹;他们的名字是Elsie，Lacie和Tillie;他们住了在井底......坏词......

我有一个大约400个坏词的列表：

bad_words = ["badword", "badword1", ....]

检查文本是否包含坏词列表中的错误单词的最有效方法是什么？

我可以遍历文本和列表，如：

for word in huge_string:
   for bw in bad_words_list: 
    if bw in word: 
       # print "bad word is inside text"...

但在我看来，这似乎来自90年代。

更新：坏词是单词。

Answer 1

将您的文字转换为一组单词并计算其与错误单词集的交集将为您提供摊销速度：

text  = "The Dormouse's story. Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well....badword..."

badwords = set(["badword", "badword1", ....])

textwords = set(word for word in text.split())
for badword in badwords.intersection(textwords):
    print("The bad word '{}' was found in the text".format(badword))

Answer 2

无需获取文本的所有单词，您可以直接检查字符串是否在另一个字符串中，例如：

In [1]: 'bad word' in 'do not say bad words!'
Out[1]: True

所以你可以这样做：

for bad_word in bad_words_list:
    if bad_word in huge_string:
        print "BAD!!"

Answer 3

类似的东西：

st = set(s.split())

bad_words = ["badword", "badword1"]
any(bad in st for bad in bad_words)

或者如果你想要的话：

st = set(s.split())

bad_words = {"badword", "badword1"}
print(st.intersection(bad_words))

如果您的句子在badword.或badword!中结束，那么set方法将失败，您实际上必须检查字符串中的每个单词并检查是否有任何错字与单词或子串相同。

st = s.split()
any(bad in word for word in st for bad in bad_words)

Answer 4

您可以使用any：

测试bad_words是否为前/后缀：

>>> bad_words = ["badword", "badword1"]
>>> text ="some text with badwords or not"
>>> any(i in text for i in bad_words)
True
>>> text ="some text with words or not"
>>> any(i in text for i in bad_words)
False

它将比较任何bad_words＆＃39; item位于text，使用＆＃34; substring＆＃34;。

测试完全匹配：

>>> text ="some text with badwords or not"
>>> any(i in text.split() for i in bad_words)
False
>>> text ="some text with badword or not"
>>> any(i in text.split() for i in bad_words)
True

它将比较任何bad_words＆＃39;项目位于text.split()，即，如果它是确切的项目。

Answer 5

s是长字符串。使用&运算符或set.intersection方法。

In [123]: set(s.split()) & set(bad_words)
Out[123]: {'badword'}

In [124]: bool(set(s.split()) & set(bad_words))
Out[124]: True

甚至更好使用set.isdisjoint。一旦找到匹配，这将短路。

In [127]: bad_words = set(bad_words)

In [128]: not bad_words.isdisjoint(s.split())
Out[128]: True

In [129]: not bad_words.isdisjoint('for bar spam'.split())
Out[129]: False

Answer 6

除了所有出色的答案之外，评论中的for now, whole words子句指向正则表达式的方向。

您可能想要构建像bad|otherbad|yetanother

这样的复合表达式

r = re.compile("|".join(badwords))
r.search(text)

Answer 7

我会使用filter函数：

filter(lambda s : s in bad_words_list, huge_string.split())

Answer 8

s = " a string with bad word"
text = s.split()

if any(bad_word in text for bad_word in ('bad', 'bad2')):
        print "bad word found"

python - 检查字符串的一部分是否在列表中的有效方法

8 个答案: