Question

我正在研究文本分类系统，我想使用unigrams作为功能。构建arff文件时，我声明了一个字符串属性字段，我想在其中指定用逗号分隔的消息中包含的所有单词。然而，Weka告诉我它“无法处理字符串归属”。我尝试使用StringToWordVector在头文件中定义关系，但它没有帮助。怎么去这个呢？非常感谢！

Answer 1

如果您的arff文件格式正确，则以下代码可以帮助您

 // dataSource: arff file (path of your arff file)

        BufferedReader trainReader = new BufferedReader(new FileReader(dataSource)); 
                       trainInsts = new Instances(trainReader); 
                       trainInsts.setClassIndex(trainInsts.numAttributes() - 1); 


        // the filter is used to convert the data from string to numeric 

                              StringToWordVector STWfilter = new StringToWordVector(); 
                              FilteredClassifier model = new FilteredClassifier(); 
                              model.setFilter(STWfilter); 

                           STWfilter.setInputFormat(trainInsts);  
// the converted data 
                           trainInsts = Filter.useFilter(trainInsts, STWfilter);

如何在Weka文件中指定字符串？

1 个答案: