我正在尝试使用nltk.pos_tag()
从list of lists text sequence
提取名词。我可以从nltk.pos_tag()
列表中提取所有名词,而无需保留列表序列的顺序?如何通过保留列表序列来实现此目的。我们非常感谢您的帮助。
在此,列表列表文本序列收集意味着:用列表分隔的标记化单词的收集。
[[[('icosmos','JJ'),('cosmology','NN'),('calculator','NN'),('with','IN'),('graph', '('JJ')],[('generation','NN'),('the','DT'),('expanding','VBG'),('universe','JJ')],[( 'american','JJ'),('institute','NN')]]
输出应如下所示:
[['cosmology','calculator'],['generation'],[institute]]
我尝试的方法如下:
def function1():
tokens_sentences = sent_tokenize(tokenized_raw_data.lower())
unfiltered_tokens = [[word for word in word_tokenize(word)] for word in tokens_sentences]
word_list = []
for i in range(len(unfiltered_tokens)):
word_list.append([])
for i in range(len(unfiltered_tokens)):
for word in unfiltered_tokens[i]:
if word[:].isalpha():
word_list[i].append(word[:])
tagged_tokens=[]
for token in word_list:
tagged_tokens.append(nltk.pos_tag(token))
noun_tagged = [(word,tag) for word, tag in tagged_tokens
if tag.startswith('NN') or tag.startswith('NNPS')]
print(nouns_tagged)
如果在添加tags_tokens列表之后在原始代码中使用了下面提到的代码空间,则输出将显示在单个列表中,这不是必需的。
only_tagged_nouns = []
for sentence in tagged_tokens:
for word, pos in sentence:
if (pos == 'NN' or pos == 'NNPS'):
only_tagged_nouns.append(word)
答案 0 :(得分:2)
您可以这样做:
words = [[('icosmos', 'JJ'), ('cosmology', 'NN'), ('calculator', 'NN'), ('with', 'IN'), ('graph', 'JJ')], [('generation', 'NN'), ('the', 'DT'), ('expanding', 'VBG'), ('universe', 'JJ')], [('american', 'JJ'), ('institute', 'NN')]]
new_list = []
for i in words:
temp = [j[0] for j in i if j[1].startswith("NN")]
new_list.append(temp)
print(new_list)
输出
[['cosmology', 'calculator'], ['generation'], ['institute']]
答案 1 :(得分:2)
将列表理解用于一种解决方案:
inputList = [[('icosmos', 'JJ'), ('cosmology', 'NN'), ('calculator', 'NN'), ('with', 'IN'), ('graph', 'JJ')], [('generation', 'NN'), ('the', 'DT'), ('expanding', 'VBG'), ('universe', 'JJ')], [('american', 'JJ'), ('institute', 'NN')]]
[[k[0] for k in j if k[1].startswith("NN")] for j in inputList]
#[['cosmology', 'calculator'], ['generation'], ['institute']]