鉴于文档中的术语匹配,访问该匹配项周围的单词的最佳方法是什么?我读过这篇文章http://searchhub.org//2009/05/26/accessing-words-around-a-positional-match-in-lucene/, 但问题是Lucene API自从这篇文章(2009)以来完全改变了,有人能指出我如何在较新版本的Lucene中做到这一点,比如Lucene 4.6.1?
修改:
我现在想出来了(已删除帖子API(TermEnum,TermDocsEnum,TermPositionsEnum),转而使用新的灵活索引(flex)API(Fields,FieldsEnum,Terms,TermsEnum,DocsEnum,DocsAndPositionsEnum)。一个很大的区别是字段和术语现在分别枚举:一个TermsEnum在一个字段内提供BytesRef(包装一个byte []),而不是Term。另一个是当你要求Docs / AndPositionsEnum时,你现在指定明确的skipDocs(通常这将是已删除的文档,但通常你可以提供任何位)。):
public class TermVectorFun {
public static String[] DOCS = {
"The quick red fox jumped over the lazy brown dogs.",
"Mary had a little lamb whose fleece was white as snow.",
"Moby Dick is a story of a whale and a man obsessed.",
"The robber wore a black fleece jacket and a baseball cap.",
"The English Springer Spaniel is the best of all dogs.",
"The fleece was green and red",
"History looks fondly upon the story of the golden fleece, but most people don't agree"
};
public static void main(String[] args) throws IOException {
RAMDirectory ramDir = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, new StandardAnalyzer(Version.LUCENE_46));
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
//Index some made up content
IndexWriter writer = new IndexWriter(ramDir, config);
for (int i = 0; i < DOCS.length; i++) {
Document doc = new Document();
Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
doc.add(id);
//Store both position and offset information
Field text = new Field("content", DOCS[i], Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
doc.add(text);
writer.addDocument(doc);
}
writer.close();
//Get a searcher
DirectoryReader dirReader = DirectoryReader.open(ramDir);
IndexSearcher searcher = new IndexSearcher(dirReader);
// Do a search using SpanQuery
SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece"));
TopDocs results = searcher.search(fleeceQ, 10);
for (int i = 0; i < results.scoreDocs.length; i++) {
ScoreDoc scoreDoc = results.scoreDocs[i];
System.out.println("Score Doc: " + scoreDoc);
}
IndexReader reader = searcher.getIndexReader();
Spans spans = fleeceQ.getSpans(reader.leaves().get(0), null, new LinkedHashMap<Term, TermContext>());
int window = 2;//get the words within two of the match
while (spans.next() == true) {
int start = spans.start() - window;
int end = spans.end() + window;
Map<Integer, String> entries = new TreeMap<Integer, String>();
System.out.println("Doc: " + spans.doc() + " Start: " + start + " End: " + end);
Fields fields = reader.getTermVectors(spans.doc());
Terms terms = fields.terms("content");
TermsEnum termsEnum = terms.iterator(null);
BytesRef text;
while((text = termsEnum.next()) != null) {
//could store the BytesRef here, but String is easier for this example
String s = new String(text.bytes, text.offset, text.length);
DocsAndPositionsEnum positionsEnum = termsEnum.docsAndPositions(null, null);
if (positionsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
int i = 0;
int position = -1;
while (i < positionsEnum.freq() && (position = positionsEnum.nextPosition()) != -1) {
if (position >= start && position <= end) {
entries.put(position, s);
}
i++;
}
}
}
System.out.println("Entries:" + entries);
}
}
}
答案 0 :(得分:0)
使用Highlighter
。 Highlighter.getBestFragment
可用于获取包含最佳匹配的文本的一部分。类似的东西:
TopDocs docs = searcher.search(query, maxdocs);
Document firstDoc = search.doc(docs.scoreDocs[0].doc);
Scorer scorer = new QueryScorer(query)
Highlighter highlighter = new Highlighter(scorer);
highlighter.GetBestFragment(myAnalyzer, fieldName, firstDoc.get(fieldName));