Python - 文本挖掘 - TypeError:__ hash__方法应该返回一个整数

时间:2016-06-24 09:15:20

标签: python machine-learning text-mining

我正在研究python中的分类问题。事实是,我在python中还不是很好。所以我很久以来就遇到了同样的问题,我不知道如何解决它。我希望你能帮助我:)。

这是我的代码:

tableau = pandas.DataFrame({'Exigence':exigence,'Résumé':resume})    

df2, targets = encode_target(tableau,"Exigence")
features = list(df2.columns[:4])

for line in resume:
   terms = prep.ngram_tokenizer(text=line)
   mx.add_doc(doc_id='some-unique-identifier',
              doc_class=df2["Target"],
              doc_terms=terms,
              frequency=True,
              do_padding=True)

我有这个错误:

objects are mutable, thus they cannot be hashed
Traceback (most recent call last):

  File "<ipython-input-9-072e9c71917a>", line 7, in <module>
    do_padding=True)

  File "C:\Users\nouguierc\AppData\Local\Continuum\Anaconda3\lib\site-  packages\irlib\matrix.py", line 222, in add_doc
    if doc_class in self.classes:

TypeError: __hash__ method should return an integer

当我走到matrix.py的第222行时,我看到了这一点:

    if doc_class in self.classes:
        self.classes[doc_class].add(my_doc_terms)

包含这些行的函数是:

def add_doc(self, doc_id = '', doc_class='', doc_terms=[], 
            frequency=False, do_padding=False):
    ''' Add new document to our matrix:
        doc_id: Identifier for the document, eg. file name, url, etc. 
        doc_class: You might need this in classification.
        doc_terms: List of terms you got after tokenizing the document.
        frequency: If true, term occurences is incremented by one.
                    Else, occurences is only 0 or 1 (a la Bernoulli)
        do_padding: Boolean. Check do_padding() for more info.
    ''' 
    # Update list of terms if new term seen.
    # And document (row) with its associated data.
    my_doc_terms = SuperList()
    for term in doc_terms:
        term_idx = self.terms.unique_append(term)
        #my_doc_terms.insert_after_padding(self.terms.index(term))
        if frequency:
            my_doc_terms.increment_after_padding(term_idx,1)
        else:
            my_doc_terms.insert_after_padding(term_idx,1)
    self.docs.append({  'id': doc_id, 
                        'class': doc_class, 
                        'terms': my_doc_terms})
    # Update list of document classes if new class seen.
    # self.classes.unique_append(doc_class)
    if doc_class in self.classes:
        self.classes[doc_class].add(my_doc_terms)
    else:
        self.classes[doc_class] = my_doc_terms
    if do_padding: 
        self.do_padding()

您如何看待我的问题?

1 个答案:

答案 0 :(得分:0)

您正在将{strong>对象作为doc_class传递,检查df2['Target']返回的内容,可能是一个pandas系列,将其转换为单个字符串,然后传递它。