我有一个应用程序要求我索引几千兆字节的句子(大约1600万行)。
目前我的搜索工作方式如下。
我的搜索字词通常围绕着一个短语。例如“在公园跑步”。我希望能够搜索与此类似的句子或包含这些短语的一部分。我是通过构建较小的短语来实现的:
“在...中奔跑” “在公园里”等。
每一个都有一个重量(较长的重量)
目前,我将每一行视为一个文件。典型的搜索大约需要几秒钟,我想知道是否有办法提高搜索速度。
除此之外,我还需要获得匹配的内容。
例如:“我今天早上在公园慢跑”在公园里“匹配”,我想知道它是如何匹配的。我知道有关lucene搜索的Explainer,但是有更简单的方法,或者是否有资源可以学习如何从Lucene的Explainer中提取我想要的信息。
我目前正在使用正则表达式获取匹配字词。它很快但不准确,因为lucene有时会忽略标点符号和其他东西,我无法处理所有特殊情况。
答案 0 :(得分:2)
Lucene的“contrib”模块Highlighter将让您提取Lucene匹配的内容。
答案 1 :(得分:2)
荧光笔优于Explainer,速度更快。 高亮后,您可以在标签之间提取匹配的短语。 Java regex to extract text between tags
public class HighlightDemo {
Directory directory;
Analyzer analyzer;
String[] contents = {"running in the park",
"I was jogging in the park this morning",
"running on the road",
"The famous New York Marathon has its final miles in Central park every year and it's easy to understand why: the park, with a variety of terrain and excellent scenery, is the ultimate runner's dream. With its many paths that range in level of difficulty, Central Park allows a runner to experience clarity and freedom in this picturesque urban oasis."};
@Before
public void setUp() throws IOException {
directory = new RAMDirectory();
analyzer = new WhitespaceAnalyzer();
// indexed documents
IndexWriter writer = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
for (int i = 0; i < contents.length; i++) {
Document doc = new Document();
doc.add(new Field("content", contents[i], Field.Store.NO, Field.Index.ANALYZED)); // store & index
doc.add(new NumericField("id", Field.Store.YES, true).setIntValue(i)); // store & index
writer.addDocument(doc);
}
writer.close();
}
@Test
public void test() throws IOException, ParseException, InvalidTokenOffsetsException {
IndexSearcher s = new IndexSearcher(directory);
QueryParser parser = new QueryParser(Version.LUCENE_36, "content", analyzer);
org.apache.lucene.search.Query query = parser.parse("park");
TopDocs hits = s.search(query, 10);
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
for (int i = 0; i < hits.scoreDocs.length; i++) {
int id = hits.scoreDocs[i].doc;
Document doc = s.doc(id);
String text = contents[Integer.parseInt(s.doc(id).get("id"))];
TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));
org.apache.lucene.search.highlight.TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, false, 10);
for (int j = 0; j < frag.length; j++) {
if ((frag[j] != null) && (frag[j].getScore() > 0)) {
assertTrue(frag[j].toString().contains("<B>"));
assertTrue(frag[j].toString().contains("</B>"));
System.out.println(frag[j].toString());
}
}
}
}
}
答案 2 :(得分:0)
SpanQueries可以帮助您找到句子中查询匹配的位置: https://lucene.apache.org/core/6_2_0/core/org/apache/lucene/search/spans/package-summary.html
使用此功能可以从查询中获得准确的位置: How to get the matching spans of a Span Term Query in Lucene 5?