Pandas NLTK令牌化"不可用类型:' list'"

时间:2016-07-29 20:26:57

标签: python pandas nltk

遵循此示例:Twitter data mining with Python and Gephi: Case synthetic biology

CSV to: df['Country', 'Responses']

'Country'
Italy
Italy
France
Germany

'Responses' 
"Loren ipsum..."
"Loren ipsum..."
"Loren ipsum..."
"Loren ipsum..."
  1. 将“'回复”中的文字标记为
  2. 删除100个最常用的单词(基于brown.corpus)
  3. 确定剩余的100个最常用词
  4. 我可以通过第1步和第2步,但在第3步出错:

    TypeError: unhashable type: 'list'
    

    我相信它是因为我在数据框架中工作并且做了这个(可能是错误的)修改:

    原始示例:

    #divide to words
    tokenizer = RegexpTokenizer(r'\w+')
    words = tokenizer.tokenize(tweets)
    

    我的代码:

    #divide to words
    tokenizer = RegexpTokenizer(r'\w+')
    df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)
    

    我的完整代码:

    df = pd.read_csv('CountryResponses.csv', encoding='utf-8', skiprows=0, error_bad_lines=False)
    
    tokenizer = RegexpTokenizer(r'\w+')
    df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)
    
    words =  df['tokenized_sents']
    
    #remove 100 most common words based on Brown corpus
    fdist = FreqDist(brown.words())
    mostcommon = fdist.most_common(100)
    mclist = []
    for i in range(len(mostcommon)):
        mclist.append(mostcommon[i][0])
    words = [w for w in words if w not in mclist]
    
    Out: ['the',
     ',',
     '.',
     'of',
     'and',
    ...]
    
    #keep only most common words
    fdist = FreqDist(words)
    mostcommon = fdist.most_common(100)
    mclist = []
    for i in range(len(mostcommon)):
        mclist.append(mostcommon[i][0])
    words = [w for w in words if w not in mclist]
    
    TypeError: unhashable type: 'list'
    

    关于不可用列表有很多问题,但我理解的并不完全相同。 有什么建议?感谢。

    TRACEBACK

    TypeError                                 Traceback (most recent call last)
    <ipython-input-164-a0d17b850b10> in <module>()
      1 #keep only most common words
    ----> 2 fdist = FreqDist(words)
      3 mostcommon = fdist.most_common(100)
      4 mclist = []
      5 for i in range(len(mostcommon)):
    
    /home/*******/anaconda3/envs/*******/lib/python3.5/site-packages/nltk/probability.py in __init__(self, samples)
        104         :type samples: Sequence
        105         """
    --> 106         Counter.__init__(self, samples)
        107 
        108     def N(self):
    
    /home/******/anaconda3/envs/******/lib/python3.5/collections/__init__.py in __init__(*args, **kwds)
        521             raise TypeError('expected at most 1 arguments, got %d' % len(args))
        522         super(Counter, self).__init__()
    --> 523         self.update(*args, **kwds)
        524 
        525     def __missing__(self, key):
    
    /home/******/anaconda3/envs/******/lib/python3.5/collections/__init__.py in update(*args, **kwds)
        608                     super(Counter, self).update(iterable) # fast path when counter is empty
        609             else:
    --> 610                 _count_elements(self, iterable)
        611         if kwds:
        612             self.update(kwds)
    
    TypeError: unhashable type: 'list'
    

1 个答案:

答案 0 :(得分:1)

FreqDist函数接受可迭代对象的可迭代对象(变为字符串,但它可能适用于任何东西)。您获得的错误是因为您传入了一个可迭代的列表。正如您所说,这是因为您所做的更改:

df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)

如果我正确理解Pandas apply function documentation,该行会将nltk.word_tokenize函数应用于某些系列。 word-tokenize会返回一个单词列表。

作为解决方案,只需在尝试应用FreqDist之前将列表添加到一起,就像这样:

allWords = []
for wordList in words:
    allWords += wordList
FreqDist(allWords)

更完整的修订版,可以按照您的意愿进行操作。如果您只需要识别第二组100,请注意mclist将第二次拥有该值。

df = pd.read_csv('CountryResponses.csv', encoding='utf-8', skiprows=0, error_bad_lines=False)

tokenizer = RegexpTokenizer(r'\w+')
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)

lists =  df['tokenized_sents']
words = []
for wordList in lists:
    words += wordList

#remove 100 most common words based on Brown corpus
fdist = FreqDist(brown.words())
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
    mclist.append(mostcommon[i][0])
words = [w for w in words if w not in mclist]

Out: ['the',
 ',',
 '.',
 'of',
 'and',
...]

#keep only most common words
fdist = FreqDist(words)
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
    mclist.append(mostcommon[i][0])
# mclist contains second-most common set of 100 words
words = [w for w in words if w in mclist]
# this will keep ALL occurrences of the words in mclist