Question

我正在使用Apache lucene来搜索文件中的字符串。 lucene使用什么样的解析。如果我搜索奥巴马，它不会返回Presobama的结果，同时返回#Obama的结果。谁能告诉我为什么？我正在使用TextField。

         StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_44);

        //  Code to create the index
        Directory index = new RAMDirectory();

        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_44, analyzer);

        IndexWriter w = new IndexWriter(index, config);
        addDoc(w, finalstep);

        w.close();
                    String querystr =  search;

        //  The \"title\" arg specifies the default field to use when no field is explicitly specified in the query
        Query q = new QueryParser(Version.LUCENE_44, "title", analyzer).parse(querystr);

        // Searching code
        int hitsPerPage = 10;
        IndexReader reader = DirectoryReader.open(index);
        IndexSearcher searcher = new IndexSearcher(reader);
        TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
        searcher.search(q, collector);
        ScoreDoc[] hits = collector.topDocs().scoreDocs;

Answer 1

分析器指示文本如何分成标记。您正在使用StandardAnalyzer。

StandardAnalyzer通常试图将流分为单词。它用来执行此操作的规则在Unicode Standard Annex #29中完整指定，但非常粗略地说：它在空格和标点符号处分隔标记。

这个“#Obama”变成“奥巴马”。分析中将删除“＃”。 “Presobama”将成为“presobama”。存在的解析规则对术语“presobama”一无所知，并且没有理由得出结论它应被视为多个单词。

有许多方法可以获得更宽松的匹配。一些可能性：您可以使用Wildcard queries，使用NGramTokenFilter索引令牌的ngrams，或者如果您只是有一些麻烦的术语，则可以使用SynonymFilter指定同义词替换

使用Apache Lucene进行搜索

1 个答案: