我正试图让Lucene的Highlighter类与来自Solr的WordDelimiterFilter的令牌一起正常工作。它在90%的时间都有效,但如果匹配的文本包含',',例如“1,500”,则输出不正确:
预期:'测试 1,500 这个'
观察:'测试1 1,500 这个'
我目前还不确定是否荧光笔搞乱了重组或者WordDelimiterFilter弄乱了标记化,但有些事情是不开心的。以下是我的pom的相关依赖项:
org.apache.lucene Lucene的核心 2.9.3 罐 编 org.apache.lucene Lucene的,荧光笔 2.9.3 罐 编 org.apache.solr Solr的核心 1.4.0 罐 编
这是一个简单的JUnit测试类,用于演示问题:
package test.lucene;
import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertTrue;
import java.io.IOException;
import java.io.Reader;
import java.util.HashMap;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleFragmenter;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.util.Version;
import org.apache.solr.analysis.StandardTokenizerFactory;
import org.apache.solr.analysis.WordDelimiterFilterFactory;
import org.junit.Test;
public class HighlighterTester {
private static final String PRE_TAG = "<b>";
private static final String POST_TAG = "</b>";
private static String[] highlightField( Query query, String fieldName, String text )
throws IOException, InvalidTokenOffsetsException {
SimpleHTMLFormatter formatter = new SimpleHTMLFormatter( PRE_TAG, POST_TAG );
Highlighter highlighter = new Highlighter( formatter, new QueryScorer( query, fieldName ) );
highlighter.setTextFragmenter( new SimpleFragmenter( Integer.MAX_VALUE ) );
return highlighter.getBestFragments( getAnalyzer(), fieldName, text, 10 );
}
private static Analyzer getAnalyzer() {
return new Analyzer() {
@Override
public TokenStream tokenStream( String fieldName, Reader reader ) {
// Start with a StandardTokenizer
TokenStream stream = new StandardTokenizerFactory().create( reader );
// Chain on a WordDelimiterFilter
WordDelimiterFilterFactory wordDelimiterFilterFactory = new WordDelimiterFilterFactory();
HashMap<String, String> arguments = new HashMap<String, String>();
arguments.put( "generateWordParts", "1" );
arguments.put( "generateNumberParts", "1" );
arguments.put( "catenateWords", "1" );
arguments.put( "catenateNumbers", "1" );
arguments.put( "catenateAll", "0" );
wordDelimiterFilterFactory.init( arguments );
return wordDelimiterFilterFactory.create( stream );
}
};
}
@Test
public void TestHighlighter() throws ParseException, IOException, InvalidTokenOffsetsException {
String fieldName = "text";
String text = "test 1,500 this";
String queryString = "1500";
String expected = "test " + PRE_TAG + "1,500" + POST_TAG + " this";
QueryParser parser = new QueryParser( Version.LUCENE_29, fieldName, getAnalyzer() );
Query q = parser.parse( queryString );
String[] observed = highlightField( q, fieldName, text );
for ( int i = 0; i < observed.length; i++ ) {
System.out.println( "\t" + i + ": '" + observed[i] + "'" );
}
if ( observed.length > 0 ) {
System.out.println( "Expected: '" + expected + "'\n" + "Observed: '" + observed[0] + "'" );
assertEquals( expected, observed[0] );
}
else {
assertTrue( "No matches found", false );
}
}
}
任何人都有任何想法或建议吗?
答案 0 :(得分:2)
经过进一步调查,这似乎是Lucene Highlighter代码中的一个错误。正如你在这里看到的那样:
public class TokenGroup {
...
protected boolean isDistinct() {
return offsetAtt.startOffset() >= endOffset;
}
...
代码尝试通过检查起始偏移量是否大于前一个结束偏移量来确定一组令牌是否不同。这个问题说明了这种方法的问题。如果你要通过令牌,你会发现它们如下:
0-4: 'test', 'test'
5-6: '1', '1'
7-10: '500', '500'
5-10: '1500', '1,500'
11-15: 'this', 'this'
从中您可以看到第三个令牌在第二个令牌结束后开始,但第四个令牌与第二个令牌相同。预期的结果将是对令牌2,3和4进行分组,但根据此实现,令牌3被视为与2分开,因此2显示为自身,然后3和4进行分组,留下此结果:
Expected: 'test <b>1,500</b> this'
Observed: 'test 1<b>1,500</b> this'
我不确定这可以在没有2次传递的情况下完成,一次是获取所有索引,另一次是将它们组合起来。另外,我不确定在这个具体案例之外会有什么影响。有没有人有任何想法?
修改
这是我提出的最终源代码。它会正确分组。它似乎比Lucene Highlighter实现简单得多,但不可否认它不能处理不同级别的评分,因为我的应用程序只需要一个是/否来判断文本片段是否突出显示。值得注意的是,我正在使用他们的QueryScorer对文本片段进行评分,这些片段确实具有面向术语而不是面向词语的弱点,这意味着搜索字符串“语法或拼写”最终会突出显示类似于“< b>语法或拼写“作为或最有可能被分析仪丢弃。无论如何,这是我的来源:
public TextFragments<E> getTextFragments( TokenStream tokenStream,
String text,
Scorer scorer )
throws IOException, InvalidTokenOffsetsException {
OffsetAttribute offsetAtt = (OffsetAttribute) tokenStream.addAttribute( OffsetAttribute.class );
TermAttribute termAtt = (TermAttribute) tokenStream.addAttribute( TermAttribute.class );
TokenStream newStream = scorer.init( tokenStream );
if ( newStream != null ) {
tokenStream = newStream;
}
TokenGroups tgs = new TokenGroups();
scorer.startFragment( null );
while ( tokenStream.incrementToken() ) {
tgs.add( offsetAtt.startOffset(), offsetAtt.endOffset(), scorer.getTokenScore() );
if ( log.isTraceEnabled() ) {
log.trace( new StringBuilder()
.append( scorer.getTokenScore() )
.append( " " )
.append( offsetAtt.startOffset() )
.append( "-" )
.append( offsetAtt.endOffset() )
.append( ": '" )
.append( termAtt.term() )
.append( "', '" )
.append( text.substring( offsetAtt.startOffset(), offsetAtt.endOffset() ) )
.append( "'" )
.toString() );
}
}
return tgs.fragment( text );
}
private class TokenGroup {
private int startIndex;
private int endIndex;
private float score;
public TokenGroup( int startIndex, int endIndex, float score ) {
this.startIndex = startIndex;
this.endIndex = endIndex;
this.score = score;
}
}
private class TokenGroups implements Iterable<TokenGroup> {
private List<TokenGroup> tgs;
public TokenGroups() {
tgs = new ArrayList<TokenGroup>();
}
public void add( int startIndex, int endIndex, float score ) {
add( new TokenGroup( startIndex, endIndex, score ) );
}
public void add( TokenGroup tg ) {
for ( int i = tgs.size() - 1; i >= 0; i-- ) {
if ( tg.startIndex < tgs.get( i ).endIndex ) {
tg = merge( tg, tgs.remove( i ) );
}
else {
break;
}
}
tgs.add( tg );
}
private TokenGroup merge( TokenGroup tg1, TokenGroup tg2 ) {
return new TokenGroup( Math.min( tg1.startIndex, tg2.startIndex ),
Math.max( tg1.endIndex, tg2.endIndex ),
Math.max( tg1.score, tg2.score ) );
}
private TextFragments<E> fragment( String text ) {
TextFragments<E> fragments = new TextFragments<E>();
int lastEndIndex = 0;
for ( TokenGroup tg : this ) {
if ( tg.startIndex > lastEndIndex ) {
fragments.add( text.substring( lastEndIndex, tg.startIndex ), textModeNormal );
}
fragments.add(
text.substring( tg.startIndex, tg.endIndex ),
tg.score > 0 ? textModeHighlighted : textModeNormal );
lastEndIndex = tg.endIndex;
}
if ( lastEndIndex < ( text.length() - 1 ) ) {
fragments.add( text.substring( lastEndIndex ), textModeNormal );
}
return fragments;
}
@Override
public Iterator<TokenGroup> iterator() {
return tgs.iterator();
}
}
答案 1 :(得分:0)
这是一个可能的原因。 您的荧光笔需要使用与搜索相同的分析器。 IIUC,您的代码使用默认分析器进行突出显示,即使它使用专门的分析器来解析查询。我相信您需要更改Fragmenter才能使用特定的TokenStream。