WEKA WordTokenizer的分隔符选项

时间:2013-04-06 13:52:58

标签: weka

我正在尝试使用StringToWordVectorWordTokenizer过滤器设置选项。这是我的代码:

StringToWordVector filter = new StringToWordVector();

//Tokenizer option (letter only)
String tokenizerOption[] = new String[2];
tokenizerOption[0] = "-tokenizer";
tokenizerOption[1] = "weka.core.tokenizers.WordTokenizer -delimiters \r\t\n .,;:\'\"()?!-><#$%&*+/@^_=[]{}|\\`~0123456789";
filter.setOptions(tokenizerOption);
filter.setInputFormat(data);

然后我将过滤后的实例保存到ARFF。我得到这个ARFF:

@attribute '\n' numeric
@attribute ' ' numeric
@attribute ' a ' numeric

如您所见,分隔符中不包含\n和空格。如何获得它包括他们?

1 个答案:

答案 0 :(得分:3)

我找到了答案,请参阅下面的代码:

//Make a filter
StringToWordVector filter = new StringToWordVector();

//Make a tokenizer
WordTokenizer wt = new WordTokenizer();
String delimiters = " \r\t\n.,;:\'\"()?!-><#$\\%&*+/@^_=[]{}|`~0123456789";
wt.setDelimiters(delimiters);
filter.setTokenizer(wt);

//Inform filter about dataset
filter.setInputFormat(data);