如何在Lucene中搜索特殊字符(+!\?:)

时间:2011-05-24 08:47:07

标签: lucene

我想在索引中搜索特殊字符。

我转义了查询字符串中的所有特殊字符但是当我在索引中对lucene执行查询时,它会将查询创建为+()。

因此它搜索没有字段。

如何解决这个问题?我的索引包含这些特殊字符。

2 个答案:

答案 0 :(得分:11)

如果您使用StandardAnalyzer,则会丢弃非字母字符。尝试使用WhitespaceAnalyzer索引相同的值,并查看是否保留了您需要的字符。它也可能保留你不想要的东西:当你考虑编写自己的Analyzer时,这基本上意味着创建一个TokenStream堆栈,它完全可以完成你需要的那种处理。

例如,SimpleAnalyzer实现了以下管道:

@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
   return new LowerCaseTokenizer(reader);
}

只是降低了令牌的数量。

StandardAnalyzer做得更多:

/** Constructs a {@link StandardTokenizer} filtered by a {@link
StandardFilter}, a {@link LowerCaseFilter} and a {@link StopFilter}. */
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
    StandardTokenizer tokenStream = new StandardTokenizer(matchVersion, reader);
    tokenStream.setMaxTokenLength(maxTokenLength);
    TokenStream result = new StandardFilter(tokenStream);
    result = new LowerCaseFilter(result);
    result = new StopFilter(enableStopPositionIncrements, result, stopSet);
    return result;
 }

你可以混合使用与org.apache.lucene.analysis中的这些组件和其他组件匹配,或者您可以编写自己的专用TokenStream实例,这些实例由自定义Analyzer包装到处理管道中。

要看的另一件事是你正在使用的CharTokenizerCharTokenizer是一个抽象类,它指定用于标记文本字符串的机制。它被一些更简单的分析器使用(但不是StandardAnalyzer)。 Lucene有两个子类:LetterTokenizerWhitespaceTokenizer。您可以创建自己的角色来保留所需的角色,并通过实施boolean isTokenChar(char c)方法打破那些角色。

答案 1 :(得分:1)

对于作者而言,这可能不是真实的,但能够搜索您需要的特殊字符:

  1. 创建自定义分析器
  2. 使用它进行索引和搜索
  3. 示例如何对我有用:

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.custom.CustomAnalyzer;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.document.TextField;
    import org.apache.lucene.index.DirectoryReader;
    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.IndexWriter;
    import org.apache.lucene.index.IndexWriterConfig;
    import org.apache.lucene.queryparser.classic.QueryParser;
    import org.apache.lucene.search.*;
    import org.apache.lucene.store.RAMDirectory;
    import org.junit.Test;
    
    import java.io.IOException;
    
    import static org.hamcrest.Matchers.equalTo;
    import static org.junit.Assert.assertThat;
    
    public class LuceneSpecialCharactersSearchTest {
    
    /**
     * Test that tries to search a string by some substring with each special character separately.
     */
    @Test
    public void testSpecialCharacterSearch() throws Exception {
        // GIVEN
        LuceneSpecialCharactersSearch service = new LuceneSpecialCharactersSearch();
        String[] luceneSpecialCharacters = new String[]{"+", "-", "&&", "||", "!", "(", ")", "{", "}", "[", "]", "^", "\"", "~", "*", "?", ":", "\\"};
    
        // WHEN
        for (String specialCharacter : luceneSpecialCharacters) {
            String actual = service.search("list's special-characters " + specialCharacter);
    
            // THEN
            assertThat(actual, equalTo(LuceneSpecialCharactersSearch.TEXT_WITH_SPECIAL_CHARACTERS));
        }
    }
    
    private static class LuceneSpecialCharactersSearch {
        private static final String TEXT_WITH_SPECIAL_CHARACTERS = "This is the list's of special-characters + - && || ! ( ) { } [ ] ^ \" ~ ? : \\ *";
    
        private final IndexWriter writer;
    
        public LuceneSpecialCharactersSearch() throws Exception {
            Document document = new Document();
            document.add(new TextField("body", TEXT_WITH_SPECIAL_CHARACTERS, Field.Store.YES));
    
            RAMDirectory directory = new RAMDirectory();
            writer = new IndexWriter(directory, new IndexWriterConfig(buildAnalyzer()));
            writer.addDocument(document);
            writer.commit();
        }
    
        public String search(String queryString) throws Exception {
            try (IndexReader reader = DirectoryReader.open(writer, false)) {
                IndexSearcher searcher = new IndexSearcher(reader);
    
                String escapedQueryString = QueryParser.escape(queryString).toLowerCase();
    
                Analyzer analyzer = buildAnalyzer();
                QueryParser bodyQueryParser = new QueryParser("body", analyzer);
                bodyQueryParser.setDefaultOperator(QueryParser.Operator.AND);
    
    
                Query bodyQuery = bodyQueryParser.parse(escapedQueryString);
                BooleanQuery query = new BooleanQuery.Builder()
                        .add(new BooleanClause(bodyQuery, BooleanClause.Occur.SHOULD))
                        .build();
                TopDocs searchResult = searcher.search(query, 1);
    
                return searcher.doc(searchResult.scoreDocs[0].doc).getField("body").stringValue();
            }
        }
    
        /**
         * Builds analyzer that is used for indexing and searching.
         */
        private static Analyzer buildAnalyzer() throws IOException {
            return CustomAnalyzer.builder()
                    .withTokenizer("whitespace")
                    .addTokenFilter("lowercase")
                    .addTokenFilter("standard")
                    .build();
    
        }
    }
    }