Lucene指数 - 单词和短语查询

时间:2012-03-28 13:23:56

标签: java lucene term phrase

我已经阅读了一些文档并构建了一个类似于

的lucene索引

文件:

id        1
keyword   foo bar
keyword   john

id        2
keyword   foo

id        3
keyword   john doe
keyword   bar foo
keyword   what the hell

我想以某种方式查询lucene,我可以将单个术语和短语结合起来。

假设我的查询是

foo bar

应该回复文档ID 1,2和3

查询

"foo bar"

应该回复doc ids 1

查询

john

应该回复文档ID 1和3

查询

john "foo bar"

应该回复doc ids 1

我在java中的实现不起作用。读大量文件也无济于事。

当我用

查询我的索引时
"foo bar"

我得到0次点击

当我用

查询我的索引时
foo "john doe"

我找回了文档ID 1,2和3(我希望只有doc id 3,因为查询的意思是foo和“john doe”)问题是,“john doe”给出了0次点击但foo回击3次。

我的目标是结合单个术语和短语术语。我究竟做错了什么?我也玩过分析仪而没有运气。

我的实现如下:

索引

  import ...

  public class Indexer
  {
    private static final Logger LOG = LoggerFactory.getLogger(Indexer.class);

    private final File indexDir;

    private IndexWriter writer;

    public Indexer(File indexDir)
    {
    this.indexDir = indexDir;
    this.writer = null;
  }

  private IndexWriter createIndexWriter()
  {
    try
    {
      Directory dir = FSDirectory.open(indexDir);
      Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
      IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_34, analyzer);
      iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
      iwc.setRAMBufferSizeMB(256.0);
      IndexWriter idx = new IndexWriter(dir, iwc);
      idx.deleteAll();
      return idx;
    } catch (IOException e)
    {
      throw new RuntimeException(String.format("Could create indexer on directory [%s]", indexDir.getAbsolutePath()), e);
    }
  }

  public void index(TestCaseDescription desc)
  {
    if (writer == null)
      writer = createIndexWriter();

    Document doc = new Document();
    addPathToDoc(desc, doc);
    addLastModifiedToDoc(desc, doc);
    addIdToDoc(desc, doc);
    for (String keyword : desc.getKeywords())
      addKeywordToDoc(doc, keyword);

    updateIndex(doc, desc);
  }

  private void addIdToDoc(TestCaseDescription desc, Document doc)
  {
    Field idField = new Field(LuceneConstants.FIELD_ID, desc.getId(), Field.Store.YES, Field.Index.ANALYZED);
    idField.setIndexOptions(IndexOptions.DOCS_ONLY);
    doc.add(idField);
  }

  private void addKeywordToDoc(Document doc, String keyword)
  {
    Field keywordField = new Field(LuceneConstants.FIELD_KEYWORDS, keyword, Field.Store.YES, Field.Index.ANALYZED);
    keywordField.setIndexOptions(IndexOptions.DOCS_ONLY);
    doc.add(keywordField);
  }

  private void addLastModifiedToDoc(TestCaseDescription desc, Document doc)
  {
    NumericField modifiedField = new NumericField(LuceneConstants.FIELD_LAST_MODIFIED);
    modifiedField.setLongValue(desc.getLastModified());
    doc.add(modifiedField);
  }

  private void addPathToDoc(TestCaseDescription desc, Document doc)
  {
    Field pathField = new Field(LuceneConstants.FIELD_PATH, desc.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
    pathField.setIndexOptions(IndexOptions.DOCS_ONLY);
    doc.add(pathField);
  }

  private void updateIndex(Document doc, TestCaseDescription desc)
  {
    try
    {
      if (writer.getConfig().getOpenMode() == OpenMode.CREATE)
      {
        // New index, so we just add the document (no old document can be there):
        LOG.debug(String.format("Adding testcase [%s] (%s)", desc.getId(), desc.getPath()));
        writer.addDocument(doc);
      } else
      {
        // Existing index (an old copy of this document may have been indexed) so
        // we use updateDocument instead to replace the old one matching the exact
        // path, if present:
        LOG.debug(String.format("Updating testcase [%s] (%s)", desc.getId(), desc.getPath()));
        writer.updateDocument(new Term(LuceneConstants.FIELD_PATH, desc.getPath()), doc);
      }
    } catch (IOException e)
    {
      throw new RuntimeException(String.format("Could not create or update index for testcase [%s] (%s)", desc.getId(),
          desc.getPath()), e);
    }
  }

  public void store()
  {
    try
    {
      writer.close();
    } catch (IOException e)
    {
      throw new RuntimeException(String.format("Could not write index [%s]", writer.getDirectory().toString()));
    }
    writer = null;
  }
}

搜索器:

import ...

public class Searcher
{
  private static final Logger LOG = LoggerFactory.getLogger(Searcher.class);

  private final Analyzer analyzer;

  private final QueryParser parser;

  private final File indexDir;

  public Searcher(File indexDir)
  {
    this.indexDir = indexDir;
    analyzer = new StandardAnalyzer(Version.LUCENE_34);
    parser = new QueryParser(Version.LUCENE_34, LuceneConstants.FIELD_KEYWORDS, analyzer);
    parser.setAllowLeadingWildcard(true);
  }

  public List<String> search(String searchString)
  {
    List<String> testCaseIds = new ArrayList<String>();
    try
    {
      IndexSearcher searcher = getIndexSearcher(indexDir);

      Query query = parser.parse(searchString);
      LOG.info("Searching for: " + query.toString(parser.getField()));
      AllDocCollector results = new AllDocCollector();
      searcher.search(query, results);

      LOG.info("Found [{}] hit", results.getHits().size());

      for (ScoreDoc scoreDoc : results.getHits())
      {
        Document doc = searcher.doc(scoreDoc.doc);
        String id = doc.get(LuceneConstants.FIELD_ID);
        testCaseIds.add(id);
      }

      searcher.close();
      return testCaseIds;
    } catch (Exception e)
    {
      throw new RuntimeException(String.format("Could not search index [%s]", indexDir.getAbsolutePath()), e);
    }

  }

  private IndexSearcher getIndexSearcher(File indexDir)
  {
    try
    {
      FSDirectory dir = FSDirectory.open(indexDir);
      return new IndexSearcher(dir);
    } catch (IOException e)
    {
      LOG.error(String.format("Could not open index directory [%s]", indexDir.getAbsolutePath()), e);
      throw new RuntimeException(e);
    }
  }
}

3 个答案:

答案 0 :(得分:3)

你为什么使用DOCS_ONLY?!如果您只索引docids,那么您只有一个带有term-&gt;文档映射的基本倒排索引,但没有邻近信息。这就是为什么你的短语查询不起作用。

答案 1 :(得分:0)

我认为你大概想要:

keyword:"foo bar"~1^2 OR keyword:"foo" OR keyword:"bar"

也就是说,短语匹配“foo bar”并提升它(更喜欢完整的短语),或匹配“foo”,或匹配“bar”。

完整查询语法位于:http://lucene.apache.org/core/old_versioned_docs/versions/3_0_0/queryparsersyntax.html

编辑:

看起来你缺少的一件事是默认运算符是OR。所以你可能想做这样的事情:

+keyword:john AND +keyword:"foo bar"

加号表示“必须包含”。你明确地放置AND,以便文档必须包含两者(而不是默认值,转换为“必须包含john OR必须包含”foo bar“)。

答案 2 :(得分:0)

问题通过替换

解决了
StandardAnalyzer

KeywordAnalyzer

表示索引器和搜索器。

正如我能够指出的那样,StandardAnalyzer将输入文本分成几个单词,我用KeywordAnalyzer替换它,因为输入(可以包含一个或多个单词)将保持不变。它会识别像

这样的术语
bla foo

作为单个关键字。