Question

我正在尝试将一个列表中的项目（单个单词）与第二个列表中的项目（完整句子）进行匹配。这是我的代码：

tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]

for word in tokens:
     for line in sentences:
         if word in line:
             print(word,line)

现在的问题是我的代码输出了子字符串，因此在寻找一个其中＆＃39; Python＆＃39;发生了，我也得到了蟒蛇＆＃39 ;;同样，我正在搞笑＆＃39;搞笑＆＃39;当我只想要包含单词＆＃39; Fun＆＃39;的句子时。

我尝试在列表中添加单词周围的空格，但这不是理想的解决方案，因为句子可能包含标点符号，而且代码不会返回匹配项。

期望的输出：
- 时间，时间很长
- 有趣，有趣！ - Python，Python很不错

Answer 1

实现检索并不是那么容易（需要更多行代码）＆＃34; Fun！＆＃34;对于Fun，同时不是＆＃34; Pythons＆＃34;对于Python ..当然可以这样做但是你的规则在我看来并不是很清楚。看看这个：

tokens = ['Time', 'Fun', 'Python']
sentences = ['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]

print([(word, phrase) for phrase in sentences for word in tokens if word in phrase.split()])
# prints: [('Time', 'Time is High'), ('Python', 'Python is Nice')]

下面你只得到完全相同的东西而不是列表理解你使用旧的for循环。我可以帮助您更轻松地理解上面的代码。

a = []
for phrase in sentences:
    words_in_phrase = phrase.split()
    for words in tokens:
        if words in words_in_phrase:
            a.append((words, phrase))
print(a)
# prints: [('Time', 'Time is High'), ('Python', 'Python is Nice')]

这里发生的是代码返回它找到的字符串以及它找到它的短语。这样做的方式是它采用sentence列表中的短语并将它们拆分为空格。所以＆＃34; Pythons＆＃34;和＃34; Python＆＃34;并不是你想要的那样，但是＃34;乐趣！＆＃34;和＆＃34;乐趣＆＃34;。这也是区分大小写的。

Answer 2

您可能希望使用动态生成的正则表达式，即对于“Python”，正则表达式看起来像'\ bPython \ b'。 '\ b'是一个单词边界。

tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]

import re
for word in tokens:
    regexp = re.compile('\b' + word + '\b')
    for line in sentences:
        if regexp.match(line):
            print(line)
            print(word,line)

Answer 3

由于您需要完全匹配，因此最好使用==代替in。

import string

tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]

for word in tokens:
     for line in sentences:
         for wrd in line.split():
             if wrd.strip(string.punctuation) == word: #strip method removes any punctuation from both sides of the wrd
                 print(word,line)

Answer 4

标记化句子最好用空格分割，因为标记化将分隔标点符号。

例如：

sentence = 'this is a test.'
>>> 'test' in 'this is a test.'.split(' ')
False
>>> nltk.word_tokenize('this is a test.')
['this', 'is', 'a', 'test','.']

代码：

tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
import nltk
for sentence in sentences:
    for token in tokens:
         if token in nltk.word_tokenize(sentence):
             print token,sentence

使用for循环匹配列表中的最短子字符串

4 个答案: