我正在使用Weka java库开发用于文本分类的分类器。我使用Stanfords'提取了许多功能。 CoreNLP包,包括文本的依赖关系解析,返回字符串"(rel,head,mod)"。
我想使用从此返回的依赖三元组作为分类功能,但我无法弄清楚如何在ARFF文件中正确表示它们。基本上,我很难过;对于每个实例,都有任意数量的依赖三元组,因此我无法在属性中明确定义它们,例如:
@attribute entityCount numeric
@attribute depTriple_1 string
@attribute depTriple_2 string
.
.
@attribute depTriple_n string
有什么特别的方法可以解决这个问题吗?我一整天都在搜索,但还没有找到任何东西。
非常感谢阅读。
答案 0 :(得分:0)
从Weka Wiki中提取:
import weka.core.Attribute;
import weka.core.FastVector;
import weka.core.Instance;
import weka.core.Instances;
/**
* Generates a little ARFF file with different attribute types.
*
* @author FracPete
*/
public class SO_Test {
public static void main(String[] args) throws Exception {
FastVector atts;
FastVector attsRel;
FastVector attVals;
FastVector attValsRel;
Instances data;
Instances dataRel;
double[] vals;
double[] valsRel;
int i;
// 1. set up attributes
atts = new FastVector();
// - numeric
atts.addElement(new Attribute("att1"));
// - nominal
attVals = new FastVector();
for (i = 0; i < 5; i++)
attVals.addElement("val" + (i+1));
atts.addElement(new Attribute("att2", attVals));
// - string
atts.addElement(new Attribute("att3", (FastVector) null));
// - date
atts.addElement(new Attribute("att4", "yyyy-MM-dd"));
// - relational
attsRel = new FastVector();
// -- numeric
attsRel.addElement(new Attribute("att5.1"));
// -- nominal
attValsRel = new FastVector();
for (i = 0; i < 5; i++)
attValsRel.addElement("val5." + (i+1));
attsRel.addElement(new Attribute("att5.2", attValsRel));
dataRel = new Instances("att5", attsRel, 0);
atts.addElement(new Attribute("att5", dataRel, 0));
// 2. create Instances object
data = new Instances("MyRelation", atts, 0);
// 3. fill with data
// first instance
vals = new double[data.numAttributes()];
// - numeric
vals[0] = Math.PI;
// - nominal
vals[1] = attVals.indexOf("val3");
// - string
vals[2] = data.attribute(2).addStringValue("This is a string!");
// - date
vals[3] = data.attribute(3).parseDate("2001-11-09");
// - relational
dataRel = new Instances(data.attribute(4).relation(), 0);
// -- first instance
valsRel = new double[2];
valsRel[0] = Math.PI + 1;
valsRel[1] = attValsRel.indexOf("val5.3");
dataRel.add(new Instance(1.0, valsRel));
// -- second instance
valsRel = new double[2];
valsRel[0] = Math.PI + 2;
valsRel[1] = attValsRel.indexOf("val5.2");
dataRel.add(new Instance(1.0, valsRel));
vals[4] = data.attribute(4).addRelation(dataRel);
// add
data.add(new Instance(1.0, vals));
// second instance
vals = new double[data.numAttributes()]; // important: needs NEW array!
// - numeric
vals[0] = Math.E;
// - nominal
vals[1] = attVals.indexOf("val1");
// - string
vals[2] = data.attribute(2).addStringValue("And another one!");
// - date
vals[3] = data.attribute(3).parseDate("2000-12-01");
// - relational
dataRel = new Instances(data.attribute(4).relation(), 0);
// -- first instance
valsRel = new double[2];
valsRel[0] = Math.E + 1;
valsRel[1] = attValsRel.indexOf("val5.4");
dataRel.add(new Instance(1.0, valsRel));
// -- second instance
valsRel = new double[2];
valsRel[0] = Math.E + 2;
valsRel[1] = attValsRel.indexOf("val5.1");
dataRel.add(new Instance(1.0, valsRel));
vals[4] = data.attribute(4).addRelation(dataRel);
// add
data.add(new Instance(1.0, vals));
// 4. output data
System.out.println(data);
}
}
您的问题尤其是“关系”属性。此代码段已处理此类关系属性。
答案 1 :(得分:0)
好吧,我做到了!只是将此作为答案发布,其他任何人都有类似的问题。以前我按照Weka Wiki上的指南(由Rushdi发布),但由于指南正在创建关系属性的静态实例,因此我遇到了很多麻烦,因为我需要任意数量的动态声明。所以我决定重新评估我是如何生成属性的,我设法通过对上述指南稍作修改来使用它:
//1. Set up attributes
FastVector atts;
FastVector relAtts;
Instances relData;
atts = new FastVector();
//Entity Count - numeric
atts.addElement(new Attribute("entityCount"));
//Dependencies - Relational (Multi-Instance)
relAtts = new FastVector();
relAtts.addElement(new Attribute("depTriplet", (FastVector) null));
relData = new Instances("depTriples", relAtts, 0);
atts.addElement(new Attribute("depTriples", relData, 0));
atts.addElement(new Attribute("postTxt", (FastVector) null));
//2. Create Instances Object
Instances trainSet = new Instances("MyName", atts, 0);
/* 3. Fill with data:
Loop through text docs to extract features
and generate instance for train set */
//Holds the relational attribute instances
Instances relAttData;
for(Object doc: docList) {
List<String> depTripleList = getDepTriples(doc);
int entCount = getEntityCount(doc);
String pt = getText(doc);
//Create instance to be added to training set
Instance tInst = new Instance(trainSet.numAttributes());
//Entity count
tInst.setValue( (Attribute) atts.elementAt(0), entCount);
//Generate Instances for relational attribute
relAttData = new Instances(trainSet.attribute(1).relation(), 0);
//For each deplist entry, create an instance and add it to dataset
for(String depTriple: depTripleList) {
Instance relAttInst = new Instance(1);
relAttInst.setDataset(relAttData);
relAttInst.setValue(0, depTriple);
relAttData.add(relAttInst);
}
//Add relational attribute (now filled with a number of Instances of attributes) to the main Instance
tInst.setValue( (Attribute) atts.elementAt(1), trainSet.attribute(1).addRelation(relAttData));
//Finally, add the instance to the relational attribute
trainSet.add(tInst)
}
//4. Output data
System.out.println(trainSet);
我意识到这可能会以不同的方式完成,但这适用于我的情况。请记住,这不是我的实际代码,而是将多个部分的摘录拼接在一起,以演示用于解决问题的过程。