遵循此示例:Twitter data mining with Python and Gephi: Case synthetic biology
CSV to: df['Country', 'Responses']
'Country'
Italy
Italy
France
Germany
'Responses'
"Loren ipsum..."
"Loren ipsum..."
"Loren ipsum..."
"Loren ipsum..."
我可以通过第1步和第2步,但在第3步出错:
TypeError: unhashable type: 'list'
我相信它是因为我在数据框架中工作并且做了这个(可能是错误的)修改:
原始示例:
#divide to words
tokenizer = RegexpTokenizer(r'\w+')
words = tokenizer.tokenize(tweets)
我的代码:
#divide to words
tokenizer = RegexpTokenizer(r'\w+')
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)
我的完整代码:
df = pd.read_csv('CountryResponses.csv', encoding='utf-8', skiprows=0, error_bad_lines=False)
tokenizer = RegexpTokenizer(r'\w+')
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)
words = df['tokenized_sents']
#remove 100 most common words based on Brown corpus
fdist = FreqDist(brown.words())
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
mclist.append(mostcommon[i][0])
words = [w for w in words if w not in mclist]
Out: ['the',
',',
'.',
'of',
'and',
...]
#keep only most common words
fdist = FreqDist(words)
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
mclist.append(mostcommon[i][0])
words = [w for w in words if w not in mclist]
TypeError: unhashable type: 'list'
关于不可用列表有很多问题,但我理解的并不完全相同。 有什么建议?感谢。
TRACEBACK
TypeError Traceback (most recent call last)
<ipython-input-164-a0d17b850b10> in <module>()
1 #keep only most common words
----> 2 fdist = FreqDist(words)
3 mostcommon = fdist.most_common(100)
4 mclist = []
5 for i in range(len(mostcommon)):
/home/*******/anaconda3/envs/*******/lib/python3.5/site-packages/nltk/probability.py in __init__(self, samples)
104 :type samples: Sequence
105 """
--> 106 Counter.__init__(self, samples)
107
108 def N(self):
/home/******/anaconda3/envs/******/lib/python3.5/collections/__init__.py in __init__(*args, **kwds)
521 raise TypeError('expected at most 1 arguments, got %d' % len(args))
522 super(Counter, self).__init__()
--> 523 self.update(*args, **kwds)
524
525 def __missing__(self, key):
/home/******/anaconda3/envs/******/lib/python3.5/collections/__init__.py in update(*args, **kwds)
608 super(Counter, self).update(iterable) # fast path when counter is empty
609 else:
--> 610 _count_elements(self, iterable)
611 if kwds:
612 self.update(kwds)
TypeError: unhashable type: 'list'
答案 0 :(得分:1)
FreqDist
函数接受可迭代对象的可迭代对象(变为字符串,但它可能适用于任何东西)。您获得的错误是因为您传入了一个可迭代的列表。正如您所说,这是因为您所做的更改:
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)
如果我正确理解Pandas apply function documentation,该行会将nltk.word_tokenize
函数应用于某些系列。 word-tokenize
会返回一个单词列表。
作为解决方案,只需在尝试应用FreqDist
之前将列表添加到一起,就像这样:
allWords = []
for wordList in words:
allWords += wordList
FreqDist(allWords)
更完整的修订版,可以按照您的意愿进行操作。如果您只需要识别第二组100,请注意mclist
将第二次拥有该值。
df = pd.read_csv('CountryResponses.csv', encoding='utf-8', skiprows=0, error_bad_lines=False)
tokenizer = RegexpTokenizer(r'\w+')
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)
lists = df['tokenized_sents']
words = []
for wordList in lists:
words += wordList
#remove 100 most common words based on Brown corpus
fdist = FreqDist(brown.words())
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
mclist.append(mostcommon[i][0])
words = [w for w in words if w not in mclist]
Out: ['the',
',',
'.',
'of',
'and',
...]
#keep only most common words
fdist = FreqDist(words)
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
mclist.append(mostcommon[i][0])
# mclist contains second-most common set of 100 words
words = [w for w in words if w in mclist]
# this will keep ALL occurrences of the words in mclist