我想把自己的话放在一边,但是不想把任何标点符号用在逗号或点上。我怎样才能做到这一点?
我目前的链是:
TokenStream tokenStream = new StandardTokenizer(LUCENE_VERSION, new StringReader(input));
tokenStream = new StandardFilter( LUCENE_VERSION, tokenStream );
tokenStream = new LowerCaseFilter(LUCENE_VERSION, tokenStream);
tokenStream = new StopFilter(LUCENE_VERSION, tokenStream, EnglishAnalyzer.getDefaultStopSet());
tokenStream = new ShingleFilter( tokenStream, 2 );
当我处理以下句子时:
A test sentence, great thing. Considering punctuation would be great, too.
结果将是(这里忽略单个单词):
test sentence; sentence great; great thing; thing considering; considering punctuation;
但是我想要跟随结果(这里忽略单个单词):
test sentence; great thing; considering punctuation;
答案 0 :(得分:0)
我自己找到了一个可能的解决方案,但我非常确定通过Lucene还有另一个(更优化的)版本。但是,我的解决方案是在用Lucene提供它之前拆分字符串。
for(String part : input.split("\\p{Punct}")) {
TokenStream tokenStream = new StandardTokenizer(LUCENE_VERSION, new StringReader(part));
tokenStream = new StandardFilter( LUCENE_VERSION, tokenStream );
tokenStream = new LowerCaseFilter(LUCENE_VERSION, tokenStream);
tokenStream = new StopFilter(LUCENE_VERSION, tokenStream, EnglishAnalyzer.getDefaultStopSet());
tokenStream = new ShingleFilter( tokenStream, 2 );
// do something with tokenStream...
}
如果您找到其他解决方案,请告知我们。