Question

我正在尝试在句子中查找关键字，这些关键字存储为列表列表。外部列表包含句子，内部列表包含句子中的单词。我想遍历每个句子中的每个单词以查找定义的关键字，并在找到的地方返回值。

这是我的token_sentences的样子。

我从这篇文章中寻求帮助。 How to iterate through a list of lists in python?但是，我得到的是一个空列表。

这是我编写的代码。

 import nltk
 from nltk.tokenize import TweetTokenizer, sent_tokenize, word_tokenize

 text = "MDCT SCAN OF THE CHEST:     HISTORY: Follow-up LUL nodule.   TECHNIQUES: Non-enhanced and contrast-enhanced MDCT scans were performed with a slice thickness of 2 mm.   COMPARISON: Chest CT dated on 01/05/2018, 05/02/207, 28/09/2016, 25/02/2016, and 21/11/2015.     FINDINGS:   Lung parenchyma: There is further increased size and solid component of part-solid nodule associated with internal bubbly lucency and pleural tagging at apicoposterior segment of the LUL (SE 3; IM 38-50), now measuring about 2.9x1.7 cm in greatest transaxial dimension (previously size 2.5x1.3 cm in 2015). Also further increased size of two ground-glass nodules at apicoposterior segment of the LUL (SE 3; IM 37), and superior segment of the LLL (SE 3; IM 58), now measuring about 1 cm (previously size 0.4 cm in 2015), and 1.1 cm (previously size 0.7 cm in 2015) in greatest transaxial dimension, respectively."  

 tokenizer_words = TweetTokenizer()
 tokens_sentences = [tokenizer_words.tokenize(t) for t in 
 nltk.sent_tokenize(text)]

 nodule_keywords = ["nodules","nodule"]
 count_nodule =[]
 def GetNodule(sentence, keyword_list):
     s1 = sentence.split(' ')
     return [i for i in  s1 if i in keyword_list]

 for sub_list in tokens_sentences:
     result_calcified_nod = GetNodule(sub_list[0], nodule_keywords)
     count_nodule.append(result_calcified_nod)

但是，由于count_nodule中的变量，我得到了一个空列表。

这是“ token_sentences”的前两行的值。

token_sentences = [['MDCT', 'SCAN', 'OF', 'THE', 'CHEST', ':', 'HISTORY', ':', 'Follow-up', 'LUL', 'nodule', '.'],['TECHNIQUES', ':', 'Non-enhanced', 'and', 'contrast-enhanced', 'MDCT', 'scans', 'were', 'performed', 'with', 'a', 'slice', 'thickness', 'of', '2', 'mm', '.']]

请帮助我找出我做错了什么地方！

Answer 1

错误在这里：

for sub_list in tokens_sentences:
     result_calcified_nod = GetNodule(sub_list[0], nodule_keywords)

您正在遍历sub_list中的每个tokens_sentences，但仅将第一个单词sub_list[0]传递到GetNodule。

这种类型的错误相当普遍，并且在某种程度上很难捕获，因为期望字符串列表的Python代码会很乐意接受并迭代单个字符串中的各个字符，如果您不正确地调用它的话。如果您想防御，最好添加类似

的内容

assert not all(len(x)==1 for x in sentence)

当然，正如@dyz在回答中指出的那样，如果您希望sentence已经是一个单词列表，则无需split该函数内部的任何内容。只需遍历句子即可。

return [w for w in sentence if w in keyword_list]

顺便说一句，您可能想用列表extend result_calcified_nod而不是append来获得最终结果。

Answer 2

您需要从s1 = sentence.split(' ')中删除GetNodule，因为sentence已被标记化（它已经是List）。
从[0]中删除GetNodule(sub_list[0], nodule_keywords)。不确定为什么要将每个句子的第一个单词传递到GetNodule！

为什么在列表列表上进行此迭代不起作用？

2 个答案: