Lucene 4.3.1荧光笔是如何工作的?我想从文档打印出搜索结果(作为搜索的单词和该单词后面的8个单词)。我如何使用荧光笔类来做到这一点?我已将完整的txt,html和xml文档添加到文件中并将其添加到我的索引中,现在我有一个搜索公式,我可能会从中添加荧光笔功能:
String index = "index";
String field = "contents";
String queries = null;
int repeat = 1;
boolean raw = true; //not sure what raw really does???
String queryString = null; //keep null, prompt user later for it
int hitsPerPage = 10; //leave it at 10, go from there later
//need to add all files to same directory
index = "C:\\Users\\plib\\Documents\\index";
repeat = 4;
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
BufferedReader in = null;
if (queries != null) {
in = new BufferedReader(new InputStreamReader(new FileInputStream(queries), "UTF-8"));
} else {
in = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
}
QueryParser parser = new QueryParser(Version.LUCENE_43, field, analyzer);
while (true) {
if (queries == null && queryString == null) { // prompt the user
System.out.println("Enter query. 'quit' = quit: ");
}
String line = queryString != null ? queryString : in.readLine();
if (line == null || line.length() == -1) {
break;
}
line = line.trim();
if (line.length() == 0 || line.equalsIgnoreCase("quit")) {
break;
}
Query query = parser.parse(line);
System.out.println("Searching for: " + query.toString(field));
if (repeat > 0) { // repeat & time as benchmark
Date start = new Date();
for (int i = 0; i < repeat; i++) {
searcher.search(query, null, 100);
}
Date end = new Date();
System.out.println("Time: "+(end.getTime()-start.getTime())+"ms");
}
doPagingSearch(in, searcher, query, hitsPerPage, raw, queries == null && queryString == null);
if (queryString != null) {
break;
}
}
reader.close();
}
答案 0 :(得分:8)
我有同样的问题,最后偶然发现了这篇文章。
http://vnarcher.blogspot.ca/2012/04/highlighting-text-with-lucene.html
关键部分是,当您迭代结果时,会在要突出显示的结果值上调用getHighlightedField
。
private String getHighlightedField(Query query, Analyzer analyzer, String fieldName, String fieldValue) throws IOException, InvalidTokenOffsetsException {
Formatter formatter = new SimpleHTMLFormatter("<span class="\"MatchedText\"">", "</span>");
QueryScorer queryScorer = new QueryScorer(query);
Highlighter highlighter = new Highlighter(formatter, queryScorer);
highlighter.setTextFragmenter(new SimpleSpanFragmenter(queryScorer, Integer.MAX_VALUE));
highlighter.setMaxDocCharsToAnalyze(Integer.MAX_VALUE);
return highlighter.getBestFragment(this.analyzer, fieldName, fieldValue);
}
在这种情况下,它假定输出将是html,它只是使用<span>
的css类使用MatchedText
包装突出显示的文本。然后,您可以定义自定义css规则,以执行任何您想要突出显示的内容。
答案 1 :(得分:7)
要使Lucene荧光笔工作,您需要在文档中添加两个要编制索引的字段。一个字段应该启用术语向量,另一个字段不使用术语向量。为简单起见,我向您展示了一个代码段:
FieldType type = new FieldType();
type.setIndexed(true);
type.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
type.setStored(true);
type.setStoreTermVectors(true);
type.setTokenized(true);
type.setStoreTermVectorOffsets(true);
Field field = new Field("content", "This is fragment. Highlters", type);
doc.add(field); //this field has term Vector enabled.
//without term vector enabled.
doc.add(new StringField("ncontent","This is fragment. Highlters", Field.Store.YES));
启用后,在索引中添加该文档。现在使用lucene荧光笔使用下面给出的方法(它使用Lucene 4.2,我没有用Lucene 4.3.1测试过):
public void highLighter() throws IOException, ParseException, InvalidTokenOffsetsException {
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File("INDEXDIRECTORY")));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser(Version.LUCENE_42, "content", analyzer);
Query query = parser.parse("Highlters"); //your search keyword
TopDocs hits = searcher.search(query, reader.maxDoc());
System.out.println(hits.totalHits);
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
for (int i = 0; i < reader.maxDoc(); i++) {
int id = hits.scoreDocs[i].doc;
Document doc = searcher.doc(id);
String text = doc.get("ncontent");
TokenStream tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), id, "ncontent", analyzer);
TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, false, 4);
for (int j = 0; j < frag.length; j++) {
if ((frag[j] != null) && (frag[j].getScore() > 0)) {
System.out.println((frag[j].toString()));
}
}
//Term vector
text = doc.get("content");
tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), hits.scoreDocs[i].doc, "content", analyzer);
frag = highlighter.getBestTextFragments(tokenStream, text, false, 10);
for (int j = 0; j < frag.length; j++) {
if ((frag[j] != null) && (frag[j].getScore() > 0)) {
System.out.println((frag[j].toString()));
}
}
System.out.println("-------------");
}
}