我有一个包含各个部分的大型文档。每个部分都有一个感兴趣的关键词/短语列表。我有一个存储为String数组的关键字/短语的主列表。如何使用Solr或Lucene搜索所有关键字的每个部分文档,并基本上给我找到了哪些关键字?我想不出任何直截了当的方法来实现这个......
由于
答案 0 :(得分:1)
从 basics
开始让程序运行,你将学习lucene索引,这应该有助于索引和搜索包含字段的文档
决定您的数据,字段需要如何 stored 。即; DateFields应存储为 Field.Index.NOT_ANALYZED ,而不是 Field.Index.ANALYZED
现在下一步应该是
//indexmap ==> HashMap
//keywordfields ==> you master list of keywords/phrases
//selectfields ==> your document field (contained in lucene index)
String[] keywordfields = (String[]) indexmap.get("keywordfields").toString().split(",");
String[] selectFields = (String[]) indexmap.get("indexfields").toString().split(",");
//create a booleanquery
BooleanQuery bq = new BooleanQuery();
//iterate the keywordfields
for (int i = 0; i < keywordfields.length; i++) {
bq.add(new BooleanClause(new TermQuery(new Term(keywordfields[i], (String)params.get(SEARCH_QUERYSTRING))),BooleanClause.Occur.SHOULD));
}
//pass the boolean query object to the indexsearcher
topDocs = indexSearcher.search(rq, 1000);
//get a reference to ScoreDoc
ScoreDoc[] hits = topDocs.scoreDocs;
//Iterate the hits
Map <String, Object> resultMap = new HashMap<String, Object>();
List<Map<String, String>> resultList = new ArrayList<Map<String, String>>();
for (ScoreDoc scoreDoc : hits) {
int docid = scoreDoc.doc;
FieldSelector fieldselector = new MapFieldSelector(selectFields);
Document doc = indexSearcher.doc(docid, fieldselector);
Map<String, String> searchMap = new HashMap<String, String>();
// get all fields for documents we got
List<Field> fields = doc.getFields();
for (Field field : fields) {
searchMap.put(field.name(), field.stringValue());
System.out.println("Field Name:" + field.name());
System.out.println("Field value:" + field.stringValue());
}
resultList.add(searchMap);
resultMap.put(TOTAL_RESULTS, hits.length);
resultMap.put(RS, resultList);
}
} catch (Exception e) {
e.printStackTrace();
}
这应该是使用Lucene =]
的实现之一答案 1 :(得分:0)
听起来你所知道的是Lucene的分析功能。这个功能的核心是Analyzer类。来自文档:
分析器构建TokenStreams,用于分析文本。因此,它代表了从文本中提取索引术语的策略。
有许多Analyzer
个类可供选择,但StandardAnalyzer
通常做得很好:
// For each chapter...
Reader reader = ...; // You are responsible for opening a reader for each chapter
Analyzer analyzer = new StandardAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("", reader);
Token token = new Token();
while ((token = tokenStream.next(token)) != null) ) {
String keyword = token.term();
// You can now do whatever you wish with this keyword
}
您可能会发现其他分析仪可以更好地完成您的工作。