使用for循环匹配列表中的最短子字符串

时间:2016-06-30 12:42:45

标签: python string

我正在尝试将一个列表中的项目(单个单词)与第二个列表中的项目(完整句子)进行匹配。这是我的代码:

tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]

for word in tokens:
     for line in sentences:
         if word in line:
             print(word,line)

现在的问题是我的代码输出了子字符串,因此在寻找一个其中' Python'发生了,我也得到了蟒蛇&#39 ;;同样,我正在搞笑'搞笑'当我只想要包含单词' Fun'的句子时。

我尝试在列表中添加单词周围的空格,但这不是理想的解决方案,因为句子可能包含标点符号,而且代码不会返回匹配项。

期望的输出:
  - 时间,时间很长
  - 有趣,有趣!   - Python,Python很不错

4 个答案:

答案 0 :(得分:0)

实现检索并不是那么容易(需要更多行代码)" Fun!"对于Fun,同时不是" Pythons"对于Python ..当然可以这样做但是你的规则在我看来并不是很清楚。看看这个:

tokens = ['Time', 'Fun', 'Python']
sentences = ['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]

print([(word, phrase) for phrase in sentences for word in tokens if word in phrase.split()])
# prints: [('Time', 'Time is High'), ('Python', 'Python is Nice')]

下面你只得到完全相同的东西而不是列表理解你使用旧的for循环。我可以帮助您更轻松地理解上面的代码。

a = []
for phrase in sentences:
    words_in_phrase = phrase.split()
    for words in tokens:
        if words in words_in_phrase:
            a.append((words, phrase))
print(a)
# prints: [('Time', 'Time is High'), ('Python', 'Python is Nice')]

这里发生的是代码返回它找到的字符串以及它找到它的短语。这样做的方式是它采用sentence列表中的短语并将它们拆分为空格。所以" Pythons"和#34; Python"并不是你想要的那样,但是#34;乐趣!"和"乐趣"。这也是区分大小写的。

答案 1 :(得分:0)

您可能希望使用动态生成的正则表达式,即对于“Python”,正则表达式看起来像'\ bPython \ b'。 '\ b'是一个单词边界。

tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]

import re
for word in tokens:
    regexp = re.compile('\b' + word + '\b')
    for line in sentences:
        if regexp.match(line):
            print(line)
            print(word,line)

答案 2 :(得分:0)

由于您需要完全匹配,因此最好使用==代替in

import string

tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]

for word in tokens:
     for line in sentences:
         for wrd in line.split():
             if wrd.strip(string.punctuation) == word: #strip method removes any punctuation from both sides of the wrd
                 print(word,line)

答案 3 :(得分:0)

标记化句子最好用空格分割,因为标记化将分隔标点符号。

例如:

sentence = 'this is a test.'
>>> 'test' in 'this is a test.'.split(' ')
False
>>> nltk.word_tokenize('this is a test.')
['this', 'is', 'a', 'test','.']

代码:

tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
import nltk
for sentence in sentences:
    for token in tokens:
         if token in nltk.word_tokenize(sentence):
             print token,sentence