我已经在solr中索引了10个网站的数据。现在我想以下列格式转储每个网站的数据:[期限,该网站的条款频率,IDF,网站]
e.g : [management,12,145,example.com]
where 12 is a frequency of term in example.com, 145 is IDF of term in index.
我可以使用solr和How吗?
答案 0 :(得分:1)
如果您想要测量文档中不同术语的分布,那么直方图就是您想要的。检查LukeRequestHandler示例。
答案 1 :(得分:0)
一些低级API:
InderReader reader = IndexReader.open(directory);
TermDocs termDocs = reader.termDocs();
// TermDocs termDocs = reader.termDocs(term); // if you need docs containing specific term
while (termDocs.next()) {
System.out.println("Doc #: " + termDocs.doc());
System.out.println("Full document: " + reader.document(termDocs.doc()));
System.out.println("Term frequency: " + termDocs.freq());
}
对于tf * idf,请参阅DefaultSimilarity和this question以获取一些评论。