我正在尝试在句子中查找关键字,这些关键字存储为列表列表。外部列表包含句子,内部列表包含句子中的单词。我想遍历每个句子中的每个单词以查找定义的关键字,并在找到的地方返回值。
我从这篇文章中寻求帮助。 How to iterate through a list of lists in python?但是,我得到的是一个空列表。
这是我编写的代码。
import nltk
from nltk.tokenize import TweetTokenizer, sent_tokenize, word_tokenize
text = "MDCT SCAN OF THE CHEST: HISTORY: Follow-up LUL nodule. TECHNIQUES: Non-enhanced and contrast-enhanced MDCT scans were performed with a slice thickness of 2 mm. COMPARISON: Chest CT dated on 01/05/2018, 05/02/207, 28/09/2016, 25/02/2016, and 21/11/2015. FINDINGS: Lung parenchyma: There is further increased size and solid component of part-solid nodule associated with internal bubbly lucency and pleural tagging at apicoposterior segment of the LUL (SE 3; IM 38-50), now measuring about 2.9x1.7 cm in greatest transaxial dimension (previously size 2.5x1.3 cm in 2015). Also further increased size of two ground-glass nodules at apicoposterior segment of the LUL (SE 3; IM 37), and superior segment of the LLL (SE 3; IM 58), now measuring about 1 cm (previously size 0.4 cm in 2015), and 1.1 cm (previously size 0.7 cm in 2015) in greatest transaxial dimension, respectively."
tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in
nltk.sent_tokenize(text)]
nodule_keywords = ["nodules","nodule"]
count_nodule =[]
def GetNodule(sentence, keyword_list):
s1 = sentence.split(' ')
return [i for i in s1 if i in keyword_list]
for sub_list in tokens_sentences:
result_calcified_nod = GetNodule(sub_list[0], nodule_keywords)
count_nodule.append(result_calcified_nod)
但是,由于count_nodule中的变量,我得到了一个空列表。
这是“ token_sentences”的前两行的值。
token_sentences = [['MDCT', 'SCAN', 'OF', 'THE', 'CHEST', ':', 'HISTORY', ':', 'Follow-up', 'LUL', 'nodule', '.'],['TECHNIQUES', ':', 'Non-enhanced', 'and', 'contrast-enhanced', 'MDCT', 'scans', 'were', 'performed', 'with', 'a', 'slice', 'thickness', 'of', '2', 'mm', '.']]
请帮助我找出我做错了什么地方!
答案 0 :(得分:2)
错误在这里:
for sub_list in tokens_sentences:
result_calcified_nod = GetNodule(sub_list[0], nodule_keywords)
您正在遍历sub_list
中的每个tokens_sentences
,但仅将第一个单词sub_list[0]
传递到GetNodule
。
这种类型的错误相当普遍,并且在某种程度上很难捕获,因为期望字符串列表的Python代码会很乐意接受并迭代单个字符串中的各个字符,如果您不正确地调用它的话。如果您想防御,最好添加类似
的内容assert not all(len(x)==1 for x in sentence)
当然,正如@dyz在回答中指出的那样,如果您希望sentence
已经是一个单词列表,则无需split
该函数内部的任何内容。只需遍历句子即可。
return [w for w in sentence if w in keyword_list]
顺便说一句,您可能想用列表extend
result_calcified_nod
而不是append
来获得最终结果。
答案 1 :(得分:2)
您需要从s1 = sentence.split(' ')
中删除GetNodule
,因为sentence
已被标记化(它已经是List
)。
从[0]
中删除GetNodule(sub_list[0], nodule_keywords)
。不确定为什么要将每个句子的第一个单词传递到GetNodule
!