如何只保留单词表中的名词单词? python NLTK

时间:2016-10-21 02:52:57

标签: python nltk text-processing wordnet pos-tagger

我有一个单词列表,其中包含许多主题。主题是从句子中自动提取的。我想只保留主题中的名词。你可以看到一些主题已经调整,我想删除它。

wordlist=['country','all','middle','various drinks','few people','its reputation','German Embassy','many elections']
returnlist=[]
for word in wordlist:
    x=wn.synsets(word)
    for syn in x:
        if syn.pos() == 'n':
            returnlist.append(word)
            break
print returnlist

以上结果是:

['country','it',  'middle']

但是,我想得到的结果应该是这样的

   wordlist=['country','it', 'middle','drinks','people','reputation','German Embassy','elections']

怎么做?

2 个答案:

答案 0 :(得分:2)

首先,你的列表是没有很好的标记化文本的结果,所以我再次对它们进行了标记 然后搜索所有单词的pos以找到包含NN的名词:

>>> text=' '.join(wordlist).lower()
>>> tokens = nltk.word_tokenize(text)
>>> tags = nltk.pos_tag(tokens)
>>> nouns = [word for word,pos in tags if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')
]
>>> nouns
['country', 'drinks', 'people', 'Embassy', 'elections']

答案 1 :(得分:0)

adjectives = ['many', 'any', 'few', 'some', 'various'] # ...
wordlist = ['country','all','middle','various drinks','few people','its reputation','German Embassy','many elections']
returnlist = []
for word in wordlist:
    for adj in adjectives:
        word = word.lower().replace(adj, '').strip()
    returnlist.append(word)
print(returnlist)