我正在使用reuter-21578的数据集进行单标签文本分类,但默认情况下数据集是多标签。许多研究人员从数据集中删除了多标签实例,他们在路透社类别中的实例数量与我的完全不同。如何删除属于数据集中多个类别的所有实例?我可以使用weka或Rapidminer来识别数据集中的多标记实例吗?
示例:
Input Dataset = {x1, x2, x3, x4, x5, x6, x7, x8, x9, x10} Labels = {acq, earn, grain , corn} Classification Results = x1, x2, x3 = acq x4, x5 = earn x6, x7, x8 = grain x9 = grain, corn x10 = grain, acq Output Dataset (what i want) = output dataset = {x1, x2, x3, x4, x5, x6, x7, x8} output labels = {acq, earn, grain, corn} Classification Results = x1, x2, x3 = acq x4, x5 = earn x6, x7, x8 = grain **OR** {This is what i assume i have achieved with PolynomiaByBinomial Operator } output dataset = {x1, x2, x3, x4, x5, x6, x7, x8, x9, x10} output labels = {acq, earn, grain, corn} Classification Results = x1, x2, x3 = acq x4, x5 = earn x6, x7, x8, x9, x10 = grain x9 = grain x10 = grain
提前致谢
答案 0 :(得分:0)
最简单的方法是将数据集分解为二进制问题。例如,如果您有数据集
text1: science
text2: sports, politics
将数据集分成3个数据集:
dataset1 (science): text1:true, text2:false
dataset2 (sports): text2:false, text2:true
dataset3 (science): text1:false, text2:true
创建3个二元分类器,每个类一个,使用相应的数据集进行训练,并合并结果。