我是Apache Lucene的新手。我正在使用Apache Lucene v7.2.1。 我需要在一个巨大的文件中进行短语搜索。我首先使用PhraseQuery制作了一个示例代码,以便在Lucene中找出短语搜索功能。但它不起作用。 我的代码如下:
public class LuceneExample
{
private static final String INDEX_DIR = "myIndexDir";
// function to create index writer
private static IndexWriter createWriter() throws IOException
{
FSDirectory dir = FSDirectory.open(Paths.get(INDEX_DIR));
IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
IndexWriter writer = new IndexWriter(dir, config);
return writer;
}
// function to create the index document.
private static Document createDocument(Integer id, String source, String target)
{
Document document = new Document();
document.add(new StringField("id", id.toString() , Store.YES));
document.add(new TextField("source", source , Store.YES));
document.add(new TextField("target", target , Store.YES));
return document;
}
// function to do index search by source
private static TopDocs searchBySource(String source, IndexSearcher searcher) throws Exception
{
// phrase query build
PhraseQuery.Builder builder = new PhraseQuery.Builder();
String[] words = source.split(" ");
int ii = 0;
for (String word : words) {
builder.add(new Term("source", word), ii);
ii = ii + 1;
}
PhraseQuery query = builder.build();
System.out.println(query);
// phrase search
TopDocs hits = searcher.search(query, 10);
return hits;
}
public static void main(String[] args) throws Exception
{
// TODO Auto-generated method stub
// create index writer
IndexWriter writer = createWriter();
//create documents object
List<Document> documents = new ArrayList<>();
String src = "Negotiation Skills are focused on resolving differences for the benefit of an individual or a group , or to satisfy various interests.";
String tgt = "Modified target : Negotiation Skills are focused on resolving differences for the benefit of an individual or a group, or to satisfy various interests.";
Document d1 = createDocument(1, src, tgt);
documents.add(d1);
src = "This point may benefit all of the participating entities, or just a single party, some of them, or all of them.";
tgt = "Modified target : This point may benefit all of the participating entities, or just a single party, some of them, or all of them.";
Document d2 = createDocument(2, src, tgt);
documents.add(d2);
writer.deleteAll();
// adding documents to index writer
writer.addDocuments(documents);
writer.commit();
writer.close();
// for index searching
Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
IndexReader reader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
//Search by source
TopDocs foundDocs = searchBySource("benefit of an individual", searcher);
System.out.println("Total Results count :: " + foundDocs.totalHits);
}
}
当我搜索上面提到的字符串“个人的利益”时。总结果计数为“0”。但它存在于文件1中。如果我能在解决这个问题上得到任何帮助,那就太好了。 提前谢谢。
答案 0 :(得分:4)
让我们从摘要开始:
规则在索引和查询时使用相同的分析链。
以下是简化和“正确”查询处理的示例:
// function to do index search by source
private static TopDocs searchBySource(String source, IndexSearcher searcher) throws Exception {
// phrase query build
PhraseQuery.Builder builder = new PhraseQuery.Builder();
TokenStream tokenStream = new StandardAnalyzer().tokenStream("source", source);
tokenStream.reset();
while (tokenStream.incrementToken()) {
CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
builder.add(new Term("source", charTermAttribute.toString()));
}
tokenStream.end();
tokenStream.close();
builder.setSlop(2);
PhraseQuery query = builder.build();
System.out.println(query);
// phrase search
TopDocs hits = searcher.search(query, 10);
return hits;
}
为了简单起见,我们可以通过使用带有空停用词列表的构造函数来从标准分析器中删除停用词 - 并且一切都会如您预期的那样简单。您可以详细了解停用词和词组查询here。
短语查询的所有问题都是从停用词开始的。在引擎盖下,Lucene保留所有单词的位置,包括在特殊索引中的停用词 - 期限职位。在某些情况下,将“目标”和“目标”分开是有用的。如果是短语查询 - 它会尝试考虑术语位置。例如,我们有一个术语“黑色和白色”,带有停用词“和”。在这种情况下,Lucene索引将有两个术语“黑色”与位置1和“白色”与位置3.天真短语查询“黑色白色”不应匹配任何东西,因为它不允许术语位置的差距。创建正确的查询有两种可能的策略:
为了创建正确的查询,您可以在查询处理中使用以下术语属性:
PositionIncrementAttribute positionIncrementAttribute = tokenStream.getAttribute(PositionIncrementAttribute.class);
我使用setSlop(2)
来简化代码段,您可以根据查询长度设置slop因子或在词组构建器中放置正确的术语位置。我的建议是不要使用停用词,你可以阅读关于停用词here。