我已经阅读了一些文档并构建了一个类似于
的lucene索引文件:
id 1
keyword foo bar
keyword john
id 2
keyword foo
id 3
keyword john doe
keyword bar foo
keyword what the hell
我想以某种方式查询lucene,我可以将单个术语和短语结合起来。
假设我的查询是
foo bar
应该回复文档ID 1,2和3
查询
"foo bar"
应该回复doc ids 1
查询
john
应该回复文档ID 1和3
查询
john "foo bar"
应该回复doc ids 1
我在java中的实现不起作用。读大量文件也无济于事。
当我用
查询我的索引时"foo bar"
我得到0次点击
当我用
查询我的索引时foo "john doe"
我找回了文档ID 1,2和3(我希望只有doc id 3,因为查询的意思是foo和“john doe”)问题是,“john doe”给出了0次点击但foo回击3次。
我的目标是结合单个术语和短语术语。我究竟做错了什么?我也玩过分析仪而没有运气。
我的实现如下:
import ...
public class Indexer
{
private static final Logger LOG = LoggerFactory.getLogger(Indexer.class);
private final File indexDir;
private IndexWriter writer;
public Indexer(File indexDir)
{
this.indexDir = indexDir;
this.writer = null;
}
private IndexWriter createIndexWriter()
{
try
{
Directory dir = FSDirectory.open(indexDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_34, analyzer);
iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
iwc.setRAMBufferSizeMB(256.0);
IndexWriter idx = new IndexWriter(dir, iwc);
idx.deleteAll();
return idx;
} catch (IOException e)
{
throw new RuntimeException(String.format("Could create indexer on directory [%s]", indexDir.getAbsolutePath()), e);
}
}
public void index(TestCaseDescription desc)
{
if (writer == null)
writer = createIndexWriter();
Document doc = new Document();
addPathToDoc(desc, doc);
addLastModifiedToDoc(desc, doc);
addIdToDoc(desc, doc);
for (String keyword : desc.getKeywords())
addKeywordToDoc(doc, keyword);
updateIndex(doc, desc);
}
private void addIdToDoc(TestCaseDescription desc, Document doc)
{
Field idField = new Field(LuceneConstants.FIELD_ID, desc.getId(), Field.Store.YES, Field.Index.ANALYZED);
idField.setIndexOptions(IndexOptions.DOCS_ONLY);
doc.add(idField);
}
private void addKeywordToDoc(Document doc, String keyword)
{
Field keywordField = new Field(LuceneConstants.FIELD_KEYWORDS, keyword, Field.Store.YES, Field.Index.ANALYZED);
keywordField.setIndexOptions(IndexOptions.DOCS_ONLY);
doc.add(keywordField);
}
private void addLastModifiedToDoc(TestCaseDescription desc, Document doc)
{
NumericField modifiedField = new NumericField(LuceneConstants.FIELD_LAST_MODIFIED);
modifiedField.setLongValue(desc.getLastModified());
doc.add(modifiedField);
}
private void addPathToDoc(TestCaseDescription desc, Document doc)
{
Field pathField = new Field(LuceneConstants.FIELD_PATH, desc.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
pathField.setIndexOptions(IndexOptions.DOCS_ONLY);
doc.add(pathField);
}
private void updateIndex(Document doc, TestCaseDescription desc)
{
try
{
if (writer.getConfig().getOpenMode() == OpenMode.CREATE)
{
// New index, so we just add the document (no old document can be there):
LOG.debug(String.format("Adding testcase [%s] (%s)", desc.getId(), desc.getPath()));
writer.addDocument(doc);
} else
{
// Existing index (an old copy of this document may have been indexed) so
// we use updateDocument instead to replace the old one matching the exact
// path, if present:
LOG.debug(String.format("Updating testcase [%s] (%s)", desc.getId(), desc.getPath()));
writer.updateDocument(new Term(LuceneConstants.FIELD_PATH, desc.getPath()), doc);
}
} catch (IOException e)
{
throw new RuntimeException(String.format("Could not create or update index for testcase [%s] (%s)", desc.getId(),
desc.getPath()), e);
}
}
public void store()
{
try
{
writer.close();
} catch (IOException e)
{
throw new RuntimeException(String.format("Could not write index [%s]", writer.getDirectory().toString()));
}
writer = null;
}
}
import ...
public class Searcher
{
private static final Logger LOG = LoggerFactory.getLogger(Searcher.class);
private final Analyzer analyzer;
private final QueryParser parser;
private final File indexDir;
public Searcher(File indexDir)
{
this.indexDir = indexDir;
analyzer = new StandardAnalyzer(Version.LUCENE_34);
parser = new QueryParser(Version.LUCENE_34, LuceneConstants.FIELD_KEYWORDS, analyzer);
parser.setAllowLeadingWildcard(true);
}
public List<String> search(String searchString)
{
List<String> testCaseIds = new ArrayList<String>();
try
{
IndexSearcher searcher = getIndexSearcher(indexDir);
Query query = parser.parse(searchString);
LOG.info("Searching for: " + query.toString(parser.getField()));
AllDocCollector results = new AllDocCollector();
searcher.search(query, results);
LOG.info("Found [{}] hit", results.getHits().size());
for (ScoreDoc scoreDoc : results.getHits())
{
Document doc = searcher.doc(scoreDoc.doc);
String id = doc.get(LuceneConstants.FIELD_ID);
testCaseIds.add(id);
}
searcher.close();
return testCaseIds;
} catch (Exception e)
{
throw new RuntimeException(String.format("Could not search index [%s]", indexDir.getAbsolutePath()), e);
}
}
private IndexSearcher getIndexSearcher(File indexDir)
{
try
{
FSDirectory dir = FSDirectory.open(indexDir);
return new IndexSearcher(dir);
} catch (IOException e)
{
LOG.error(String.format("Could not open index directory [%s]", indexDir.getAbsolutePath()), e);
throw new RuntimeException(e);
}
}
}
答案 0 :(得分:3)
你为什么使用DOCS_ONLY?!如果您只索引docids,那么您只有一个带有term-&gt;文档映射的基本倒排索引,但没有邻近信息。这就是为什么你的短语查询不起作用。
答案 1 :(得分:0)
我认为你大概想要:
keyword:"foo bar"~1^2 OR keyword:"foo" OR keyword:"bar"
也就是说,短语匹配“foo bar”并提升它(更喜欢完整的短语),或匹配“foo”,或匹配“bar”。
完整查询语法位于:http://lucene.apache.org/core/old_versioned_docs/versions/3_0_0/queryparsersyntax.html
编辑:
看起来你缺少的一件事是默认运算符是OR。所以你可能想做这样的事情:
+keyword:john AND +keyword:"foo bar"
加号表示“必须包含”。你明确地放置AND,以便文档必须包含两者(而不是默认值,转换为“必须包含john OR必须包含”foo bar“)。
答案 2 :(得分:0)
问题通过替换
解决了StandardAnalyzer
带
KeywordAnalyzer
表示索引器和搜索器。
正如我能够指出的那样,StandardAnalyzer将输入文本分成几个单词,我用KeywordAnalyzer替换它,因为输入(可以包含一个或多个单词)将保持不变。它会识别像
这样的术语bla foo
作为单个关键字。