我尝试使用朴素贝叶斯分类器对样本语料库进行分类。示例语料库如下(存储在myfile.csv中):
"Text";"label"
“There be no significant perinephric collection";"label1”
“There be also fluid collection”;”label2”
“No discrete epidural collection or abscess be see";"label1”
“This be highly suggestive of epidural abscess”;”label2”
“No feature of spondylodiscitis be see”;”label1”
“At the level of l2 l3 there be loculated epidural fluid collection”;”label2”
分类器的代码如下:
# libraries for dataset preparation, feature engineering, model training
import pandas as pd
import csv
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
#Data preparation
data = pd.read_csv(open('myfile.csv'), sep=';', quoting=csv.QUOTE_NONE)
# Creating Bag of Words
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data)
print(X_train_counts.shape)
#From occurrences to frequencies
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
print(X_train_tf.shape)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print(X_train_tfidf.shape)
#Training a classifier
clf = MultinomialNB().fit(X_train_tfidf, data['label'])
#Predicting with the classifier
docs_new = ['there is no spondylodiscitis', 'there is a large fluid collection']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
print('%r => %s' % (doc, data['label']))
每当我尝试运行预测时,都会出现以下错误:
KeyError: 'label'
我要去哪里错了?
答案 0 :(得分:1)
如有疑问,请在REPL或调试器中加载代码。观察...
中的内容与您的问题无关。
import pandas as pd
import csv
...
data = pd.read_csv(open('myfile.csv'), sep=';', quoting=csv.QUOTE_NONE)
import pdb; pdb.set_trace()
...
现在,我们可以交互地查询data
对象:
(Pdb) data.keys()
Index(['"Text"', '"label"'], dtype='object')
(Pdb) data['"label"']
0 "label1”
1 ”label2”
2 "label1”
3 ”label2”
4 ”label1”
5 ”label2”
Name: "label", dtype: object
(Pdb) data["label"]
*** KeyError: 'label'
请注意,键是'"Test"'
和'"label"'
,而不是"Test"
和"label"
。因此,您无法执行data["label"]
,否则您将得到所看到的KeyError
。您必须说data['"label"']
。
答案 1 :(得分:0)
您的数据似乎带有引号,为什么在此指定了QUOTE_NONE?
答案 2 :(得分:0)
如果您希望能够使用data['label']
访问pandas列,
您的第一行应该是:
Text;label
不是这个:
"Text";"label"
这样,您必须像这样索引标签col;
data['"label"']
看起来不太好