
时间:2018-08-15 05:31:48

标签: python machine-learning scikit-learn multilabel-classification


我已经对其进行了SVM分类及其工作。 (在这里我有兴趣查看每个班级的准确性,因此我在代码中看到的每个班级都应用了OneVsRestClassifier。)


例如: 这是传递给模型进行预测的数据

0,"extreme weight gain, short-term memory loss, hair loss.",1,0,0,0,0,0,0
1,I am detoxing from Lexapro now.,0,0,0,0,0,0,1
2,I slowly cut my dosage over several months and took vitamin supplements to help.,0,0,0,0,0,0,1
3,I am now 10 days completely off and OMG is it rough.,0,0,0,0,0,0,1
4,"I have flu-like symptoms, dizziness, major mood swings, lots of anxiety, tiredness.",0,1,0,0,0,0,1
5,I have no idea when this will end.,1,0,0,0,0,0,1


我知道我们可以使用scikit-learn库中的Label Binarization来做到这一点。

问题是fit_transform的输入参数解释了here与我准备并传递给SVM分类的目标数据不同。 所以我不知道该怎么解决。


df = pd.read_csv("finalupdatedothers.csv")
categories = ['ADR','WD','EF','INF','SSI','DI','others']

train,test = train_test_split(df,random_state=42,test_size=0.3,shuffle=True)
X_train = train.sentences
X_test = test.sentences

SVC_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words)),
                ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),

for category in categories:
    print('... Processing {} '.format(category))
    prediction = SVC_pipeline.predict(X_test)
    print('SVM Linear Test accuracy is {} '.format(accuracy_score(test[category], prediction)))
    print 'SVM Linear f1 measurement is {} '.format(f1_score(test[category], prediction, average='weighted'))
    print "\n"


1 个答案:

答案 0 :(得分:1)


import pandas as pd
import numpy as np
from sklearn import svm
from sklearn.datasets import samples_generator
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier

from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
df = pd.read_csv("finalupdatedothers.csv")
categories = ['ADR','WD','EF','INF','SSI','DI','others']

train,test = train_test_split(df,random_state=42,test_size=0.3,shuffle=True)
X_train = train.sentences
X_test = test.sentences

SVC_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=[])),
                ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),

for category in categories:
    print('... Processing {} '.format(category))
    prediction = SVC_pipeline.predict(X_test)
    print([{X_test.iloc[i]:categories[prediction[i]]} for i in range(len(list(prediction)))  ])

    print('SVM Linear Test accuracy is {} '.format(accuracy_score(test[category], prediction)))
    print ('SVM Linear f1 measurement is {} '.format(f1_score(test[category], prediction, average='weighted')))
    print ("\n")


... Processing ADR 
[{'extreme weight gain, short-term memory loss, hair loss.': 'ADR'}, {'I am detoxing from Lexapro now.': 'ADR'}]
SVM Linear Test accuracy is 0.5 
SVM Linear f1 measurement is 0.3333333333333333 

... Processing WD 
[{'extreme weight gain, short-term memory loss, hair loss.': 'ADR'}, {'I am detoxing from Lexapro now.': 'ADR'}]
SVM Linear Test accuracy is 1.0 
SVM Linear f1 measurement is 1.0 
