我正在尝试将一个列表中的项目(单个单词)与第二个列表中的项目(完整句子)进行匹配。这是我的代码:
tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
for word in tokens:
for line in sentences:
if word in line:
print(word,line)
现在的问题是我的代码输出了子字符串,因此在寻找一个其中' Python'发生了,我也得到了蟒蛇&#39 ;;同样,我正在搞笑'搞笑'当我只想要包含单词' Fun'的句子时。
我尝试在列表中添加单词周围的空格,但这不是理想的解决方案,因为句子可能包含标点符号,而且代码不会返回匹配项。
期望的输出:
- 时间,时间很长
- 有趣,有趣!
- Python,Python很不错
答案 0 :(得分:0)
实现检索并不是那么容易(需要更多行代码)" Fun!"对于Fun
,同时不是" Pythons"对于Python
..当然可以这样做但是你的规则在我看来并不是很清楚。看看这个:
tokens = ['Time', 'Fun', 'Python']
sentences = ['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
print([(word, phrase) for phrase in sentences for word in tokens if word in phrase.split()])
# prints: [('Time', 'Time is High'), ('Python', 'Python is Nice')]
下面你只得到完全相同的东西而不是列表理解你使用旧的for循环。我可以帮助您更轻松地理解上面的代码。
a = []
for phrase in sentences:
words_in_phrase = phrase.split()
for words in tokens:
if words in words_in_phrase:
a.append((words, phrase))
print(a)
# prints: [('Time', 'Time is High'), ('Python', 'Python is Nice')]
这里发生的是代码返回它找到的字符串以及它找到它的短语。这样做的方式是它采用sentence
列表中的短语并将它们拆分为空格。所以" Pythons"和#34; Python"并不是你想要的那样,但是#34;乐趣!"和"乐趣"。这也是区分大小写的。
答案 1 :(得分:0)
您可能希望使用动态生成的正则表达式,即对于“Python”,正则表达式看起来像'\ bPython \ b'。 '\ b'是一个单词边界。
tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
import re
for word in tokens:
regexp = re.compile('\b' + word + '\b')
for line in sentences:
if regexp.match(line):
print(line)
print(word,line)
答案 2 :(得分:0)
由于您需要完全匹配,因此最好使用==
代替in
。
import string
tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
for word in tokens:
for line in sentences:
for wrd in line.split():
if wrd.strip(string.punctuation) == word: #strip method removes any punctuation from both sides of the wrd
print(word,line)
答案 3 :(得分:0)
标记化句子最好用空格分割,因为标记化将分隔标点符号。
例如:
sentence = 'this is a test.'
>>> 'test' in 'this is a test.'.split(' ')
False
>>> nltk.word_tokenize('this is a test.')
['this', 'is', 'a', 'test','.']
代码:
tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
import nltk
for sentence in sentences:
for token in tokens:
if token in nltk.word_tokenize(sentence):
print token,sentence