Apache Lucene模糊搜索多措辞短语

时间:2018-03-29 10:39:57

标签: lucene fuzzy-search

我有以下Apache Lucene 7应用程序:

StandardAnalyzer standardAnalyzer = new StandardAnalyzer();
Directory directory = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(standardAnalyzer);
IndexWriter writer = new IndexWriter(directory, config);
Document document = new Document();

document.add(new TextField("content", new FileReader("document.txt"))); 
writer.addDocument(document);
writer.close();

IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);

Query fuzzyQuery = new FuzzyQuery(new Term("content", "Company"), 2);

TopDocs results = searcher.search(fuzzyQuery, 5);
System.out.println("Hits: " + results.totalHits);
System.out.println("Max score:" + results.getMaxScore())

当我使用它时:

new FuzzyQuery(new Term("content", "Company"), 2);

应用程序正常工作并返回以下结果:

Hits: 1
Max score:0.35161147

但是当我尝试使用多项查询进行搜索时,例如:

新的FuzzyQuery(新术语(“内容”,“公司名称”),2);

它返回以下结果:

Hits: 0
Max score:NaN

无论如何,源Company name文件中存在短语document.txt

如何在这种情况下正确使用FuzzyQuery,以便能够对多词短语进行模糊搜索。

已更新

根据提供的解决方案,我已根据以下文字信息对其进行了测试:

Company name: BlueCross BlueShield              Customer Service 
   1-800-521-2227           
                        of Texas                          Preauth-Medical              1-800-441-9188           
                                                          Preauth-MH/CD                1-800-528-7264           
                                                          Blue Card Access             1-800-810-2583     

对于以下查询:

SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueCross"), 2));
clauses[1] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueShield"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);

搜索工作正常:

Hits: 1
Max score:0.5753642

但是当我尝试破坏搜索查询时(例如从BlueCrossBlueCros

SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueCros"), 2));
clauses[1] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueShield"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);

它停止工作并返回:

Hits: 0
Max score:NaN

2 个答案:

答案 0 :(得分:1)

此处的问题如下,您正在使用TextField,这是令牌化字段。例如。您的文本"Company name is working on something"将被空格(以及其他分隔符)有效地分割。因此,即使您有文字Company name,在索引编制期间也会变为Companynameis等。

在这种情况下,此TermQuery无法找到您正在寻找的内容。帮助你的技巧看起来像这样:

SpanQuery[] clauses = new SpanQuery[2];
    clauses[0] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("content", "some"), 2));
    clauses[1] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("content", "text"), 2));
    SpanNearQuery query = new SpanNearQuery(clauses, 0, true);

但是,我不会多推荐这种方法,特别是如果您的负担很大并且您正在计划搜索10个长期的公司名称。应该知道,那些查询可能很难执行。

BlueCros的以下问题如下。默认情况下,Lucene对TextField使用 StandardAnalyzer 。因此,它意味着它有效地小写了术语,基本上这意味着BlueCross字段中的content变为bluecross

BlueCrosbluecross之间的模糊差异为3,这就是您没有匹配的原因。

简单的提案是将查询中的字词转换为小写字母,方法是.toLowerCase()

一般情况下,人们应该更喜欢在查询时间内使用相同的分析器(例如在构建查询期间)

答案 1 :(得分:0)

对于Lucene.Net来说可能就是这样。

private string _IndexPath = @"Your Index Path";
private Directory _Directory;
private Searcher _IndexSearcher;
private MultiPhraseQuery _MultiPhraseQuery;

_Directory = FSDirectory.Open(_IndexPath);
IndexReader indexReader = IndexReader.Open(_Directory, true);

string field = "Name" // Your field name
string keyword = "big red fox"; // your search term 
float fuzzy = 0,7f; // between 0-1
using (_IndexSearcher = new IndexSearcher(indexReader))
{
    // "big red fox" to [big,red,fox]
    var keywordSplit = keyword.Split();

    _MultiPhraseQuery = new MultiPhraseQuery();
    FuzzyTermEnum[] _FuzzyTermEnum = new FuzzyTermEnum[keywordSplit.Length];
    Term[] _Term = new Term[keywordSplit.Length];

    for (int i = 0; i < keywordSplit.Length; i++)
    {
        _FuzzyTermEnum[i] = new FuzzyTermEnum(indexReader, new Term(field, keywordSplit[i]),fuzzy);
        _Term[i] = _FuzzyTermEnum[i].Term;
        if (_Term[i] == null)
        {
            _MultiPhraseQuery.Add(new Term(field, keywordSplit[i]));
        }
        else
        {
            _MultiPhraseQuery.Add(_FuzzyTermEnum[i].Term);
        }
    }

    var results = _IndexSearcher.Search(_MultiPhraseQuery, indexReader.MaxDoc);

    foreach (var loopDoc in results.ScoreDocs.OrderByDescending(s => s.Score))
    {
        //YourCode Here
    }
}