如何用.arff文件格式表示依赖三元组?

时间:2015-03-17 06:17:12

标签: nlp weka feature-extraction arff

我正在使用Weka java库开发用于文本分类的分类器。我使用Stanfords'提取了许多功能。 CoreNLP包,包括文本的依赖关系解析,返回字符串"(rel,head,mod)"。

我想使用从此返回的依赖三元组作为分类功能,但我无法弄清楚如何在ARFF文件中正确表示它们。基本上,我很难过;对于每个实例,都有任意数量的依赖三元组,因此我无法在属性中明确定义它们,例如:

@attribute entityCount numeric
@attribute depTriple_1 string
@attribute depTriple_2 string
.
.
@attribute depTriple_n string

有什么特别的方法可以解决这个问题吗?我一整天都在搜索,但还没有找到任何东西。

非常感谢阅读。

2 个答案:

答案 0 :(得分:0)

从Weka Wiki中提取:

import weka.core.Attribute;
import weka.core.FastVector;
import weka.core.Instance;
import weka.core.Instances;

/**
 * Generates a little ARFF file with different attribute types.
 *
 * @author FracPete
 */
public class SO_Test {
  public static void main(String[] args) throws Exception {
    FastVector      atts;
    FastVector      attsRel;
    FastVector      attVals;
    FastVector      attValsRel;
    Instances       data;
    Instances       dataRel;
    double[]        vals;
    double[]        valsRel;
    int             i;

    // 1. set up attributes
    atts = new FastVector();
    // - numeric
    atts.addElement(new Attribute("att1"));
    // - nominal
    attVals = new FastVector();
    for (i = 0; i < 5; i++)
      attVals.addElement("val" + (i+1));
    atts.addElement(new Attribute("att2", attVals));
    // - string
    atts.addElement(new Attribute("att3", (FastVector) null));
    // - date
    atts.addElement(new Attribute("att4", "yyyy-MM-dd"));
    // - relational
    attsRel = new FastVector();
    // -- numeric
    attsRel.addElement(new Attribute("att5.1"));
    // -- nominal
    attValsRel = new FastVector();
    for (i = 0; i < 5; i++)
      attValsRel.addElement("val5." + (i+1));
    attsRel.addElement(new Attribute("att5.2", attValsRel));
    dataRel = new Instances("att5", attsRel, 0);
    atts.addElement(new Attribute("att5", dataRel, 0));

    // 2. create Instances object
    data = new Instances("MyRelation", atts, 0);

    // 3. fill with data
    // first instance
    vals = new double[data.numAttributes()];
    // - numeric
    vals[0] = Math.PI;
    // - nominal
    vals[1] = attVals.indexOf("val3");
    // - string
    vals[2] = data.attribute(2).addStringValue("This is a string!");
    // - date
    vals[3] = data.attribute(3).parseDate("2001-11-09");
    // - relational
    dataRel = new Instances(data.attribute(4).relation(), 0);
    // -- first instance
    valsRel = new double[2];
    valsRel[0] = Math.PI + 1;
    valsRel[1] = attValsRel.indexOf("val5.3");
    dataRel.add(new Instance(1.0, valsRel));
    // -- second instance
    valsRel = new double[2];
    valsRel[0] = Math.PI + 2;
    valsRel[1] = attValsRel.indexOf("val5.2");
    dataRel.add(new Instance(1.0, valsRel));
    vals[4] = data.attribute(4).addRelation(dataRel);
    // add
    data.add(new Instance(1.0, vals));

    // second instance
    vals = new double[data.numAttributes()];  // important: needs NEW array!
    // - numeric
    vals[0] = Math.E;
    // - nominal
    vals[1] = attVals.indexOf("val1");
    // - string
    vals[2] = data.attribute(2).addStringValue("And another one!");
    // - date
    vals[3] = data.attribute(3).parseDate("2000-12-01");
    // - relational
    dataRel = new Instances(data.attribute(4).relation(), 0);
    // -- first instance
    valsRel = new double[2];
    valsRel[0] = Math.E + 1;
    valsRel[1] = attValsRel.indexOf("val5.4");
    dataRel.add(new Instance(1.0, valsRel));
    // -- second instance
    valsRel = new double[2];
    valsRel[0] = Math.E + 2;
    valsRel[1] = attValsRel.indexOf("val5.1");
    dataRel.add(new Instance(1.0, valsRel));
    vals[4] = data.attribute(4).addRelation(dataRel);
    // add
    data.add(new Instance(1.0, vals));

    // 4. output data
    System.out.println(data);
  }
}

您的问题尤其是“关系”属性。此代码段已处理此类关系属性。

答案 1 :(得分:0)

好吧,我做到了!只是将此作为答案发布,其他任何人都有类似的问题。以前我按照Weka Wiki上的指南(由Rushdi发布),但由于指南正在创建关系属性的静态实例,因此我遇到了很多麻烦,因为我需要任意数量的动态声明。所以我决定重新评估我是如何生成属性的,我设法通过对上述指南稍作修改来使用它:

    //1. Set up attributes
    FastVector atts;
    FastVector relAtts;
    Instances relData;
    atts = new FastVector();

    //Entity Count - numeric
    atts.addElement(new Attribute("entityCount"));

    //Dependencies - Relational (Multi-Instance)
    relAtts = new FastVector();


    relAtts.addElement(new Attribute("depTriplet", (FastVector) null));
    relData = new Instances("depTriples", relAtts, 0);


    atts.addElement(new Attribute("depTriples", relData, 0));
    atts.addElement(new Attribute("postTxt", (FastVector) null));

    //2. Create Instances Object
    Instances trainSet = new Instances("MyName", atts, 0);

    /* 3. Fill with data:
       Loop through text docs to extract features 
          and generate instance for train set */
    //Holds the relational attribute instances
    Instances relAttData;

    for(Object doc: docList) {
        List<String> depTripleList = getDepTriples(doc);
        int entCount = getEntityCount(doc);
        String pt = getText(doc);

        //Create instance to be added to training set
        Instance tInst = new Instance(trainSet.numAttributes());

        //Entity count
        tInst.setValue( (Attribute) atts.elementAt(0), entCount);

        //Generate Instances for relational attribute
        relAttData = new Instances(trainSet.attribute(1).relation(), 0);

        //For each deplist entry, create an instance and add it to dataset
        for(String depTriple: depTripleList) {
             Instance relAttInst = new Instance(1);
             relAttInst.setDataset(relAttData);

             relAttInst.setValue(0, depTriple);

             relAttData.add(relAttInst);
        }

        //Add relational attribute (now filled with a number of Instances of attributes) to the main Instance
        tInst.setValue( (Attribute) atts.elementAt(1), trainSet.attribute(1).addRelation(relAttData));

       //Finally, add the instance to the relational attribute
       trainSet.add(tInst)
    }

    //4. Output data
    System.out.println(trainSet);

我意识到这可能会以不同的方式完成,但这适用于我的情况。请记住,这不是我的实际代码,而是将多个部分的摘录拼接在一起,以演示用于解决问题的过程。