我正在尝试在Weka中创建TF-IDF功能的向量,类似于scikit learning中TfidfVectorizer生成的向量(后来我想获得类似于fit_transform生成的矩阵)。
直到现在,我只是改编了一个从Internet获得的示例。但是,生成的向量是错误的。我在这里迷路了。我一直在寻找解决方案,但没有任何效果。
我的arff文件
@relation balanceado
@attribute description string
@attribute rate numeric
@data
'The hotel was excellent in all aspects.',5
'overall an 8, breakfast was acceptable',5
'slept like a baby!!! :)',5
'Bad hotel',1
'I will not come back',1
'Horrible experience',1
'Awful experience',1
'Probably the best hotel in Waterville',5
'slept like a baby!!! :)',5
'The worst service',1
我的代码
Instances dataset = source.getDataSet();
dataset.setClassIndex(1);
StringToWordVector filter = new StringToWordVector();
filter.setWordsToKeep(1000000);
NGramTokenizer t = new NGramTokenizer();
t.setNGramMaxSize(1);
t.setNGramMinSize(1);
filter.setTokenizer(t);
filter.setTFTransform(true);
filter.setIDFTransform(true);
filter.setLowerCaseTokens(true);
filter.setOutputWordCounts(true);
filter.setStopwords(new File("stopwords/english-stop-words.txt"));
filter.setInputFormat(data);
filter.batchFinished();
output = Filter.useFilter(data, filter);
我的代码生成以下输出:
@data
{0 5}
{0 5}
{0 5}
{0 1}
{0 1}
{0 1}
{0 1}
{0 5}
{0 5}
{0 1}
输出显示未计算频率。仅选择类别并将其放在零之后。