Question

我必须检查给定列表中的元素是否在文本中，如果它是单个单词我可以，但如果它包含多个单词，如下所示我无法获取

text="what is the price of wheat and White Pepper?"

words=['wheat','White Pepper','rice','pepper']

Expected output=['wheat','White Pepper']

我试过以下方式，但没有得到预期的输出，任何人都可以帮助我吗？

>>> output=[word for word in words if word in text]

>>> print output

>>> ['rice', 'White Pepper', 'wheat']

这里用“价格”这个词来代表“米饭”

如果我使用nltk或任何它会将“白胡椒”分成“白色”和“胡椒”

>>> from nltk import word_tokenize

>>> n_words=word_tokenize(text)

>>> print n_words

>>> ['what', 'is', 'the', 'price', 'of', 'wheat', 'and', 'White', 'Pepper', '?']

>>> output=[word for word in words if word in n_words]
>>> print output

>>> ['wheat']

Answer 1

你可以使用正则表达式和单词边界：

import re

text="what is the price of wheat and White Pepper?"

words=['wheat','White Pepper','rice','pepper']

output=[word for word in words if re.search(r"\b{}\b".format(word),text)]

print(output)

结果：

['wheat', 'White Pepper']

您可以通过预先构建正则表达式来优化搜索（礼貌Jon Clements）：

output = re.findall(r'\b|\b'.join(sorted(words, key=len, reverse=True)), text)

必须进行排序以确保首先获取最长的字符串。可能没有必要进行正则表达式转义，因为单词只包含空格和字母。

Answer 2

所以我会这样做。

def findWord(list, text):
    words = []
    for i in list:
        index = text.find(i) 
        if index != -1:
            if index != 0 and text[index - 1] != " ":
                continue 
            words.append(i)
    return words

如果字符串不存在，字符串的find函数将返回-1。白胡椒返回31，因为它是它开始的指数。

这会为您提供的测试用例返回['wheat', and 'White Pepper']。

如何使用python检查给定列表中的元素是否在文本中？

2 个答案: