我在数据集上应用了svm。我的数据集是多标签的,意味着每个观察都有一个以上的标签。
在KFold cross-validation
期间引发错误not in index
。
它显示从601到6007的索引not in index
(我有1 ... 6008个数据样本)。
这是我的代码:
df = pd.read_csv("finalupdatedothers.csv")
categories = ['ADR','WD','EF','INF','SSI','DI','others']
X= df[['sentences']]
y = df[['ADR','WD','EF','INF','SSI','DI','others']]
kf = KFold(n_splits=10)
kf.get_n_splits(X)
for train_index, test_index in kf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
SVC_pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
for category in categories:
print('... Processing {} '.format(category))
# train the model using X_dtm & y
SVC_pipeline.fit(X_train['sentences'], y_train[category])
prediction = SVC_pipeline.predict(X_test['sentences'])
print('SVM Linear Test accuracy is {} '.format(accuracy_score(X_test[category], prediction)))
print 'SVM Linear f1 measurement is {} '.format(f1_score(X_test[category], prediction, average='weighted'))
print([{X_test[i]: categories[prediction[i]]} for i in range(len(list(prediction)))])
实际上,我不知道如何应用KFold交叉验证,在该交叉验证中我可以分别获得每个标签的F1分数和准确性。 看过this和this并没有帮助我如何成功申请我的案子。
为便于复制,这是数据框的一小部分 最后七个功能是我的标签,包括ADR,WD,...
,sentences,ADR,WD,EF,INF,SSI,DI,others
0,"extreme weight gain, short-term memory loss, hair loss.",1,0,0,0,0,0,0
1,I am detoxing from Lexapro now.,0,0,0,0,0,0,1
2,I slowly cut my dosage over several months and took vitamin supplements to help.,0,0,0,0,0,0,1
3,I am now 10 days completely off and OMG is it rough.,0,0,0,0,0,0,1
4,"I have flu-like symptoms, dizziness, major mood swings, lots of anxiety, tiredness.",0,1,0,0,0,0,0
5,I have no idea when this will end.,0,0,0,0,0,0,1
更新
当我做什么时,Vivek Kumar说这会引发错误
ValueError: Found input variables with inconsistent numbers of samples: [1, 5408]
在分类器部分。你有解决的办法吗?
有几个链接可以解决这个stackoverflow中的错误,这表明我需要重塑训练数据。我也这样做了,但没有成功link 谢谢:)
答案 0 :(得分:9)
train_index
,test_index
是基于行数的整数索引。但是熊猫索引不能那样工作。较新版本的熊猫在切片或从中选择数据的方式更加严格。
您需要使用.iloc
来访问数据。更多信息是available here
这就是您需要的:
for train_index, test_index in kf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
...
...
# TfidfVectorizer dont work with DataFrame,
# because iterating a DataFrame gives the column names, not the actual data
# So specify explicitly the column name, to get the sentences
SVC_pipeline.fit(X_train['sentences'], y_train[category])
prediction = SVC_pipeline.predict(X_test['sentences'])