我正在尝试使用StringToWordVector
为WordTokenizer
过滤器设置选项。这是我的代码:
StringToWordVector filter = new StringToWordVector();
//Tokenizer option (letter only)
String tokenizerOption[] = new String[2];
tokenizerOption[0] = "-tokenizer";
tokenizerOption[1] = "weka.core.tokenizers.WordTokenizer -delimiters \r\t\n .,;:\'\"()?!-><#$%&*+/@^_=[]{}|\\`~0123456789";
filter.setOptions(tokenizerOption);
filter.setInputFormat(data);
然后我将过滤后的实例保存到ARFF。我得到这个ARFF:
@attribute '\n' numeric
@attribute ' ' numeric
@attribute ' a ' numeric
如您所见,分隔符中不包含\n
和空格。如何获得它包括他们?
答案 0 :(得分:3)
我找到了答案,请参阅下面的代码:
//Make a filter
StringToWordVector filter = new StringToWordVector();
//Make a tokenizer
WordTokenizer wt = new WordTokenizer();
String delimiters = " \r\t\n.,;:\'\"()?!-><#$\\%&*+/@^_=[]{}|`~0123456789";
wt.setDelimiters(delimiters);
filter.setTokenizer(wt);
//Inform filter about dataset
filter.setInputFormat(data);