对SVM文本进行分类时出错

时间:2018-08-17 11:37:02

标签: python python-3.x pandas scikit-learn

我正在尝试应用文本排序算法,不幸的是我遇到了错误

import sklearn
import numpy as np
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
import pandas as pd
import pandas

dataset = pd.read_csv('train.csv', encoding = 'utf-8')
data = dataset['data']
labels = dataset['label']

X_train, X_test, y_train, y_test = train_test_split (data.data, labels.target, test_size = 0.2, random_state = 0)


vecteur = CountVectorizer()
X_train_counts = vecteur.fit_transform(X_train)

tfidf = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf = MultinomialNB().fit(X_train_tfidf, y_train)

#SVM
clf = svm.SVC(kernel = 'linear', C = 10).fit(X_train, y_train)
print(clf.score(X_test, y_test))

我遇到以下错误:

  

回溯(最近通话最近一次):

     

中的文件“ bayes_classif.py”,第22行      

数据集= pd.read_csv('train.csv',编码='utf-8')

     

parser_f中的文件“ /usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py”,第678行

     

返回_read(filepath_or_buffer,kwds)

     

_read中的文件“ /usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py”,第446行

     

data = parser.read(nrows)

     

文件“ /usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py”,第1036行,处于读取状态

     

ret = self._engine.read(nrows)

     

文件“ /usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py”,第1848行,处于读取状态

     

data = self._reader.read(nrows)

     

pandas._libs.parsers.TextReader.read中的文件“ pandas / _libs / parsers.pyx”,行876

     

在pandas._libs.parsers.TextReader._read_low_memory中的文件“ pandas / _libs / parsers.pyx”,第891行

     

在pandas._libs.parsers.TextReader._read_rows中的文件“ pandas / _libs / parsers.pyx”,第945行

     

在pandas._libs.parsers.TextReader._tokenize_rows中的文件“ pandas / _libs / parsers.pyx”,第932行

     

pandas._libs.parsers.raise_parser_error中的文件“ pandas / _libs / parsers.pyx”,第2112行   pandas.errors.ParserError:标记数据时出错。 C错误:第72行中应有2个字段,看到了3

我的数据

data, label
bought noon <product> provence <product> shop givors moment <price> bad surprise <time> made account price <price> catalog expect part minimum refund difference wait read brief delay, refund

parcel ordered friend n still not arrive possible destination send back pay pretty unhappy act gift birth <date> status parcel n not moved weird think lost stolen share quickly solutions can send gift both time good <time>, call

ordered <product> coat recovered calais city europe shops n not used assemble parties up <time> thing done <time> bad surprise parties not aligned correctly can see photo can exchange made refund man, annulation

note <time> important traces rust articles come to buy acting carrying elements going outside extremely disappointed wish to return together immediately full refund indicate procedure sabrina beillevaire <phone_numbers>, refund

note <time> important traces rust articles come to buy acts acting bearing elements going outside extremely disappointed wish to return together immediately full refund indicate procedure <phone_numbers>, annulation

request refund box jewelry arrived completely broken box n not protected free delivery directly packaging plastic item fragile cardboard box <product> interior shot cover cardboard torn corners <product> completely broken, call

1 个答案:

答案 0 :(得分:1)

您可以尝试使用干净的代码重现相同的错误吗?您的代码包含一些错误和不必要的行。我们还需要您的数据样本,以帮助重现错误,否则我们将无能为力。

这是我想您正在尝试做的,请尝试使用您的数据启动它,并告诉我们是否仍然遇到相同的错误:

import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

dataset = pd.DataFrame({'data':['A first sentence','And a second sentence','Another one','Yet another line','And a last one'],
                    'label':[1,0,0,1,1]})
data = dataset['data']
labels = dataset['label']

X_train, X_test, y_train, y_test = train_test_split (data, labels, test_size = 0.2, random_state = 0)


vecteur = CountVectorizer()
tfidf = TfidfTransformer()

X_train_counts = vecteur.fit_transform(X_train)
X_train_tfidf = tfidf.fit_transform(X_train_counts)
X_test_tfidf = tfidf.transform(vecteur.transform(X_test))

clf = svm.SVC(kernel = 'linear', C = 10).fit(X_train_tfidf, y_train)
print(clf.score(X_test_tfidf, y_test))

编辑:

根据您的数据,该错误可能是由于csv文件中的逗号引起的,从而导致pandas解析器出错。您可以使用erro_bad_lines中的read_csv参数来告诉熊猫忽略这些行。这是一个简短的示例:

temp=u"""data, label
A first working line, refund
a second ok line, call
last line with an inside comma: , character which makes it bug, call"""
df = pd.read_csv(pd.compat.StringIO(temp),error_bad_lines=False)