使用Meka Java进行多标签分类

时间:2018-10-18 13:21:04

标签: java multilabel-classification

任何人都可以帮助获得使用meka java代码对多标签数据集进行分类的完整文档..i必须首先训练80%的数据然后测试20%的数据。如何使用meka做到这一点?这是我的数据集的样子,前六个属性是类

     @attribute IS_PROTECTION_binarized {0,1}
     @attribute IS_PRICING_binarized {0,1}
     @attribute IS_ERROR_binarized {0,1}
     @attribute IS_USAGE_binarized {0,1}
     @attribute IS_COMPATIBILITY_binarized {0,1}
     @attribute IS_RESOURCES_binarized {0,1}
     @attribute text string

     @data
     0,0,1,0,1,0,'keeps crashing since i upgraded my android this game keeps crashing'
     0,0,0,0,0,0,'addictive i first became a fan of this game when i got an app that u had to earn coins to unlock diffrent colored lights how u got coins was to play games and it just happened tbat one of the mini games was this kind of game'
     0,1,0,0,0,0,'ad free port of the original open source game'

1 个答案:

答案 0 :(得分:0)

您可以使用scikit-multilearn,类LabelPowerset可以解决问题,只需选择一个基本的多类分类器即可。不过,您可能需要对text属性进行某些操作,因此使用管道可能很重要。

from skmultilearn.problem_transform import LabelPowerset
from sklearn.ensemble import RandomForestClassifier

# initialize LabelPowerset multi-label classifier with a RandomForest
classifier = LabelPowerset(
    classifier = RandomForestClassifier(n_estimators=100),
    require_dense = [False, True]
)

# train
classifier.fit(X_train, y_train)

# predict
predictions = classifier.predict(X_test)

管道看起来像this

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', classifier),
])