任何人都可以帮助获得使用meka java代码对多标签数据集进行分类的完整文档..i必须首先训练80%的数据然后测试20%的数据。如何使用meka做到这一点?这是我的数据集的样子,前六个属性是类
@attribute IS_PROTECTION_binarized {0,1}
@attribute IS_PRICING_binarized {0,1}
@attribute IS_ERROR_binarized {0,1}
@attribute IS_USAGE_binarized {0,1}
@attribute IS_COMPATIBILITY_binarized {0,1}
@attribute IS_RESOURCES_binarized {0,1}
@attribute text string
@data
0,0,1,0,1,0,'keeps crashing since i upgraded my android this game keeps crashing'
0,0,0,0,0,0,'addictive i first became a fan of this game when i got an app that u had to earn coins to unlock diffrent colored lights how u got coins was to play games and it just happened tbat one of the mini games was this kind of game'
0,1,0,0,0,0,'ad free port of the original open source game'
答案 0 :(得分:0)
您可以使用scikit-multilearn,类LabelPowerset可以解决问题,只需选择一个基本的多类分类器即可。不过,您可能需要对text属性进行某些操作,因此使用管道可能很重要。
from skmultilearn.problem_transform import LabelPowerset
from sklearn.ensemble import RandomForestClassifier
# initialize LabelPowerset multi-label classifier with a RandomForest
classifier = LabelPowerset(
classifier = RandomForestClassifier(n_estimators=100),
require_dense = [False, True]
)
# train
classifier.fit(X_train, y_train)
# predict
predictions = classifier.predict(X_test)
管道看起来像this:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', classifier),
])