TypeError:无法将数组数据从dtype('float64')转换为dtype('<U32')进行Python中的KNN文本分类

时间:2019-11-27 09:16:18

标签: python scikit-learn text-classification knn

我有以下代码:

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import numpy as np
import pandas as pd
from csv import reader,writer
import operator as op
import string
from sklearn import neighbors

#Read data from corpus
r = reader(open('one100words.csv','r'))
abstract_list = []
score_list = []
institute_list = []
row_count = 0
for row in list(r)[1:]:
    institute,score,abstract = row[0], row[1], row[2]
    if len(abstract.split()) > 0:
      institute_list.append(institute)
      score = float(score)
      if score >= 3.2:
          score = 3.2
      elif (score >= 3.0 and score < 3.2):
          score = 3.0
      elif (score >= 2.8 and score < 3.0):
          score = 2.8
      elif (score >= 2.5 and score < 2.8):
          score = 2.5
      elif (score >= 2.2 and score < 2.5):
          score = 2.2
      elif (score >= 2.0 and score < 2.2):
          score = 2.0
      elif (score >= 1.5 and score < 2.0):
          score = 1.5
      elif (score >= 1.0 and score < 1.5):
          score = 1.0
      score_list.append(score)
      abstract = abstract.translate(string.punctuation).lower()
      abstract_list.append(abstract)
      row_count = row_count + 1

print("Total processed data: ", row_count)

#Vectorize (TF-IDF, ngrams 1-4, no stop words) using sklearn -->
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,4),
                     min_df = 0, stop_words = 'english', sublinear_tf=True)
response = vectorizer.fit_transform(abstract_list)
classes = score_list
feature_names = vectorizer.get_feature_names()

#clf = neighbors.KNeighborsClassifier(n_neighbors=1)
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
clf.fit(response, classes)

abstract_input = "Originality: First to combine optimization and first order query processing in relational learning. Significance: Markov Logic was considered to be impractical for most NLP applications. This work changed this, and is the basis of a software (2000+ downloads) used throughout the world. E.g., Henry Kautz, inventor of the MaxWalkSAT (MWS) algorithm, used our algorithm in place of his own MWS. James Allen, inventor of Allen's Temporal Interval Algebra used our tool for temporal reasoning. Rigour: Theoretical analysis proves optimality (or epsilon-optimality). For two real-world models  orders of magnitude reduction in runtime and memory usage at no loss of accuracy."

predicted = clf.predict([[abstract_input]])

我正在尝试使用KNN进行文本分类。但是,我在代码的最后一行遇到了错误,如下所示:

  

TypeError:无法根据规则“安全”将数组数据从dtype('float64')转换为dtype('U32')

该错误的完整回溯如下:

Traceback (most recent call last):
  File "knn-custom.py", line 58, in <module>
    predicted = clf.predict([[abstract_input]])
  File "/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 256, in predict
    return self._decision_function(X)
  File "/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 241, in _decision_function
    dense_output=True) + self.intercept_
  File "/anaconda3/lib/python3.6/site-packages/sklearn/utils/extmath.py", line 140, in safe_sparse_dot
    return np.dot(a, b)
TypeError: Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'

有人知道如何解决此错误或问题吗?

谢谢。

P.S。尽管有针对类似错误的解决方案,但是该解决方案不适用于该测试案例。现有答案:Numpy.dot TypeError: Cannot cast array data from dtype('float64') to dtype('S32') according to the rule 'safe'

0 个答案:

没有答案