从Lucene索引中的字段中获取所有单词

时间:2014-03-20 12:32:47

标签: lucene lucene.net

这样的问题已被问过很多(例如hereherehere,......)而且我无法从这些答案中得到我需要的东西可能只是我不是通过“term”或“termdoc”了解Lucene的需求。

我建立了一个Lucene索引:

var db = new DataClassesDataContext();
var articles = (from article in db.Articles
                orderby article.articleID ascending
                select article).ToList();
var analyzer = new StandardAnalyzer(Version.LUCENE_30);
using (var writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
{
    foreach (var article in articles)
    {
        var luceneDocument = new Document();
        luceneDocument.Add(new Field("ArticleID", article.articleID.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
        luceneDocument.Add(new Field("Title", article.title, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
        luceneDocument.Add(new Field("Paragraph", article.paragraph, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
        writer.AddDocument(luceneDocument);
    }
    Console.WriteLine("Optimizing index.");
    writer.Optimize();
}

这很好用,我可以检索任何术语频率向量。例如

var titleVector = indexReader.GetTermFreqVector(5001, "Title");

给出结果{Title: doing/1, healthcare/1, right/1}。但我想列举将单词(如“做”,“医疗保健”和“正确”)映射到标题包含每个单词的文档的id的倒排索引。我想构建一个CSV文件,其中每一行都是word, ArticleID_1, ArticleID_2, ... , ArticleID_n

到目前为止我所做的不起作用(它吐出所有条款):

var terms = indexReader.Terms();
while (terms.Next())
{
    Console.WriteLine(terms.Term.Text);
}

如何从我的文档中的“标题”字段中获取索引用作条款的所有单词的列表?即如何仅将最后一个代码段限制为标题字段条件?

1 个答案:

答案 0 :(得分:1)

典型的,我刚写下这个问题而不是一个答案!

var terms = indexReader.Terms();
while (terms.Next())
{
    if (terms.Term.Field == "Title")
    {
        var row = "\"" + terms.Term.Text + "\", ";
        var termDocs = indexReader.TermDocs(terms.Term);
        while (termDocs.Next())
        {
            row += indexReader[termDocs.Doc].Get("ArticleID") + ", ";
        }
        row.TrimEnd(new char[] { ',', ' ' });
        titleFile.WriteLine(row);
    }
}