onevsrestClassifier与在多标签分类中分别在标签上应用分类器之间的区别

时间:2018-09-08 19:27:34

标签: machine-learning text-classification multilabel-classification scikit-multilearn

我的目的是做multi-label classification。 我已经读过这篇link,上面讲的正是我所寻找的东西。但是,它更多地基于概念而不是所实现的代码。

我有两个使用svm linear的代码。在第一个代码中,我应用了onevsrestclassifier;在另一个代码中,我分别构建了每个分类器。 但就F1 measure recall precision而言,我的结果差异很大。但是正如上面的链接所述,应该没有太大的区别!

我想知道我做错了什么!

使用oneVsRestClassifier

的第一种方法
df = pd.read_csv('finalupdatedothers.csv')
X= df.sentences
dfy = df[['ADR','WD','EF','INF','SSI','DI','others']]
stop_words = stopwords.words('english')
classifier=Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words)),
                ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
            ])
kf = KFold(n_splits=10)
kf.get_n_splits(X)
for train_index, test_index in kf.split(X,dfy):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = dfy.iloc[train_index], dfy.iloc[test_index]

classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
print ("SVM f-measure "+str(f1_score(y_test, predicted, average='weighted')))
print("SVM precision"+str(precision_score(y_test,predicted,average='weighted')))
print("SVM recall"+str(recall_score(y_test,predicted,average='weighted')))

对于这种方法,我得到了以下结果:

SVM f-measure 0.6396653428191672
SVM precision0.7153314849944064
SVM recall0.5955056179775281

第二种方法,而我分别在每个标签上应用分类器:

df = pd.read_csv("finalupdatedothers.csv")
categories = ['ADR','WD','EF','INF','SSI','DI','others']
X= df.sentences
y = df[['ADR','WD','EF','INF','SSI','DI','others']]
kf = KFold(n_splits=10)
kf.get_n_splits(X)
for train_index, test_index in kf.split(X,y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

SVC_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words)),
                ('clf', LinearSVC())
            ])

for category in categories:

    print('... Processing {} '.format(category))
    SVC_pipeline.fit(X_train, y_train[category])
    prediction = SVC_pipeline.predict(X_test)
    print 'SVM Linear f1 measurement is {} '.format(f1_score(y_test[category], prediction, average='weighted'))
    print("SVM precision" + str(precision_score(y_test[category], prediction, average='weighted')))
    print("SVM recall" + str(recall_score(y_test[category], prediction, average='weighted')))

我从这个link中了解了上述方法 对于这种方法,对于每个类别,我都得到了良好的结果,最小值是77,最大值是96,总体来说是好的结果。

是什么原因? 这是我的数据:

id,sentences,ADR,WD,EF,INF,SSI,DI,others
0,"extreme weight gain, short-term memory loss, hair loss.",1,0,0,0,0,0,0
1,I am detoxing from Lexapro now.,0,0,0,0,0,0,1
2,I slowly cut my dosage over several months and took vitamin supplements to help.,0,0,0,0,0,0,1
3,I am now 10 days completely off and OMG is it rough.,0,0,0,0,0,0,1
4,"I have flu-like symptoms, dizziness, major mood swings, lots of anxiety, tiredness.",0,1,0,0,0,0,0
5,I have no idea when this will end.,0,0,0,0,0,0,1

0 个答案:

没有答案