这样的问题已被问过很多(例如here,here,here,......)而且我无法从这些答案中得到我需要的东西可能只是我不是通过“term”或“termdoc”了解Lucene的需求。
我建立了一个Lucene索引:
var db = new DataClassesDataContext();
var articles = (from article in db.Articles
orderby article.articleID ascending
select article).ToList();
var analyzer = new StandardAnalyzer(Version.LUCENE_30);
using (var writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
{
foreach (var article in articles)
{
var luceneDocument = new Document();
luceneDocument.Add(new Field("ArticleID", article.articleID.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
luceneDocument.Add(new Field("Title", article.title, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
luceneDocument.Add(new Field("Paragraph", article.paragraph, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
writer.AddDocument(luceneDocument);
}
Console.WriteLine("Optimizing index.");
writer.Optimize();
}
这很好用,我可以检索任何术语频率向量。例如
var titleVector = indexReader.GetTermFreqVector(5001, "Title");
给出结果{Title: doing/1, healthcare/1, right/1}
。但我想列举将单词(如“做”,“医疗保健”和“正确”)映射到标题包含每个单词的文档的id的倒排索引。我想构建一个CSV文件,其中每一行都是word, ArticleID_1, ArticleID_2, ... , ArticleID_n
到目前为止我所做的不起作用(它吐出所有条款):
var terms = indexReader.Terms();
while (terms.Next())
{
Console.WriteLine(terms.Term.Text);
}
如何从我的文档中的“标题”字段中获取索引用作条款的所有单词的列表?即如何仅将最后一个代码段限制为标题字段条件?
答案 0 :(得分:1)
典型的,我刚写下这个问题而不是一个答案!
var terms = indexReader.Terms();
while (terms.Next())
{
if (terms.Term.Field == "Title")
{
var row = "\"" + terms.Term.Text + "\", ";
var termDocs = indexReader.TermDocs(terms.Term);
while (termDocs.Next())
{
row += indexReader[termDocs.Doc].Get("ArticleID") + ", ";
}
row.TrimEnd(new char[] { ',', ' ' });
titleFile.WriteLine(row);
}
}