Question

我正在尝试使用nltk.pos_tag()从list of lists text sequence提取名词。我可以从nltk.pos_tag()列表中提取所有名词，而无需保留列表序列的顺序？如何通过保留列表序列来实现此目的。我们非常感谢您的帮助。

在此，列表列表文本序列收集意味着：用列表分隔的标记化单词的收集。

[[[（'icosmos'，'JJ'），（'cosmology'，'NN'），（'calculator'，'NN'），（'with'，'IN'），（'graph'， '（'JJ'）]，[（'generation'，'NN'），（'the'，'DT'），（'expanding'，'VBG'），（'universe'，'JJ'）]，[（ 'american'，'JJ'），（'institute'，'NN'）]]

输出应如下所示：

[['cosmology'，'calculator']，['generation']，[institute]]

我尝试的方法如下：

def function1():
    tokens_sentences = sent_tokenize(tokenized_raw_data.lower())
    unfiltered_tokens = [[word for word in word_tokenize(word)] for word in tokens_sentences]
    word_list = []
    for i in range(len(unfiltered_tokens)):
        word_list.append([]) 
    for i in range(len(unfiltered_tokens)):
        for word in unfiltered_tokens[i]:
            if word[:].isalpha():
               word_list[i].append(word[:])
    tagged_tokens=[]
    for token in word_list:
        tagged_tokens.append(nltk.pos_tag(token))
    noun_tagged = [(word,tag) for word, tag in tagged_tokens 
            if tag.startswith('NN') or tag.startswith('NNPS')]
    print(nouns_tagged)

如果在添加tags_tokens列表之后在原始代码中使用了下面提到的代码空间，则输出将显示在单个列表中，这不是必需的。

only_tagged_nouns = []
for sentence in tagged_tokens:
    for word, pos in sentence:
        if (pos == 'NN' or pos == 'NNPS'):
            only_tagged_nouns.append(word)

Answer 1

您可以这样做：

words = [[('icosmos', 'JJ'), ('cosmology', 'NN'), ('calculator', 'NN'), ('with', 'IN'), ('graph', 'JJ')], [('generation', 'NN'), ('the', 'DT'), ('expanding', 'VBG'), ('universe', 'JJ')], [('american', 'JJ'), ('institute', 'NN')]]

new_list = []
for i in words:
    temp = [j[0] for j in i if j[1].startswith("NN")]
    new_list.append(temp)

print(new_list)

输出

[['cosmology', 'calculator'], ['generation'], ['institute']]

Answer 2

将列表理解用于一种解决方案：

inputList = [[('icosmos', 'JJ'), ('cosmology', 'NN'), ('calculator', 'NN'), ('with', 'IN'), ('graph', 'JJ')], [('generation', 'NN'), ('the', 'DT'), ('expanding', 'VBG'), ('universe', 'JJ')], [('american', 'JJ'), ('institute', 'NN')]]

[[k[0] for k in j if k[1].startswith("NN")] for j in inputList]

#[['cosmology', 'calculator'], ['generation'], ['institute']]

从列表pos_tag序列列表中仅提取名词？

2 个答案: