带有nltk的FreqDist:ValueError:解压缩的值太多了

时间:2013-11-14 10:31:07

标签: python-2.7 nltk frequency-distribution

我一直试图找到给定句子中名词的频率分布。如果我这样做:

text = "This ball is blue, small and extraordinary. Like no other ball."
text=text.lower()
token_text= nltk.word_tokenize(text)
tagged_sent = nltk.pos_tag(token_text)
nouns= []
for word,pos in tagged_sent:
    if pos in ['NN',"NNP","NNS"]:
        nouns.append(word)
freq_nouns=nltk.FreqDist(nouns)
print freq_nouns

它认为“球”和“球”。作为单独的词。所以我继续tokenized the sentence before tokenizing the words

text = "This ball is blue, small and extraordinary. Like no other ball."
text=text.lower()
sentences = nltk.sent_tokenize(text)                        
words = [nltk.word_tokenize(sent)for sent in sentences]    
tagged_sent = [nltk.pos_tag(sent)for sent in words]
nouns= []
for word,pos in tagged_sent:
    if pos in ['NN',"NNP","NNS"]:
        nouns.append(word)
freq_nouns=nltk.FreqDist(nouns)
print freq_nouns

它出现以下错误:

Traceback (most recent call last):
File "C:\beautifulsoup4-4.3.2\Trial.py", line 19, in <module>
for word,pos in tagged_sent:
ValueError: too many values to unpack

我做错了什么?请帮忙。

1 个答案:

答案 0 :(得分:3)

你太近了!

在这种情况下,由于您的列表理解tagged_sent = [nltk.pos_tag(sent)for sent in words],您已将tagged_sent从元组列表更改为元组列表列表。

您可以采取以下措施来发现您拥有的对象类型:

>>> type(tagged_sent), len(tagged_sent)
(<type 'list'>, 2)

这表明你有一个清单;在这种情况下有2个句子。您可以进一步检查以下句子之一:

>>> type(tagged_sent[0]), len(tagged_sent[0])
(<type 'list'>, 9)

您可以看到第一个句子是另一个列表,包含9个项目。嗯,其中一个项目是什么样的?好吧,让我们看看第一个列表的第一项:

>>> tagged_sent[0][0]
('this', 'DT')

如果你好奇地看到我常常看到的整个对象,你可以问pprint(漂亮的打印)模块,让它看起来更漂亮:

>>> from pprint import pprint
>>> pprint(tagged_sent)
[[('this', 'DT'),
  ('ball', 'NN'),
  ('is', 'VBZ'),
  ('blue', 'JJ'),
  (',', ','),
  ('small', 'JJ'),
  ('and', 'CC'),
  ('extraordinary', 'JJ'),
  ('.', '.')],
 [('like', 'IN'), ('no', 'DT'), ('other', 'JJ'), ('ball', 'NN'), ('.', '.')]]

所以,很长的答案是你的代码需要迭代新的第二层列表,如下所示:

nouns= []
for sentence in tagged_sent:
    for word,pos in sentence:
        if pos in ['NN',"NNP","NNS"]:
            nouns.append(word)

当然,这只会返回一个非唯一的项目列表,如下所示:

>>> nouns
['ball', 'ball']

您可以通过多种方式使用此列表,但您可以快速使用set()数据结构,如下所示:

unique_nouns = list(set(nouns))
>>> print unique_nouns
['ball']

如果您可以通过其他方式检查项目列表,请查看稍微过时但非常有用的内容:http://www.peterbe.com/plog/uniqifiers-benchmark