将Multilabel数据集转换为单一标签?

时间:2014-11-26 08:55:20

标签: machine-learning weka data-mining rapidminer text-classification

我正在使用reuter-21578的数据集进行单标签文本分类,但默认情况下数据集是多标签。许多研究人员从数据集中删除了多标签实例,他们在路透社类别中的实例数量与我的完全不同。如何删除属于数据集中多个类别的所有实例?我可以使用weka或Rapidminer来识别数据集中的多标记实例吗?

示例:


    Input Dataset = {x1, x2, x3, x4, x5, x6, x7, x8, x9, x10}
    Labels = {acq, earn, grain , corn}


    Classification Results = 

    x1, x2, x3 = acq
    x4, x5 = earn
    x6, x7, x8 = grain
    x9 = grain, corn
    x10 = grain, acq

    Output Dataset (what i want) = 
    output dataset = {x1, x2, x3, x4, x5, x6, x7, x8}
    output labels = {acq, earn, grain, corn}

    Classification Results = 

    x1, x2, x3 = acq
    x4, x5 = earn
    x6, x7, x8 = grain

    **OR**
    {This is what i assume i have achieved with PolynomiaByBinomial Operator }
    output dataset = {x1, x2, x3, x4, x5, x6, x7, x8, x9, x10}
    output labels = {acq, earn, grain, corn}
    Classification Results = 

    x1, x2, x3 = acq
    x4, x5 = earn
    x6, x7, x8, x9, x10 = grain
    x9 = grain
    x10 = grain

提前致谢

1 个答案:

答案 0 :(得分:0)

最简单的方法是将数据集分解为二进制问题。例如,如果您有数据集

text1: science
text2: sports, politics

将数据集分成3个数据集:

dataset1 (science): text1:true, text2:false
dataset2 (sports): text2:false, text2:true
dataset3 (science): text1:false, text2:true

创建3个二元分类器,每个类一个,使用相应的数据集进行训练,并合并结果。