Lucene没有为文档中的某些术语编制索引

时间:2017-05-31 08:54:54

标签: lucene lucene.net pylucene

我一直在尝试使用Lucene来索引我们的代码数据库。不幸的是,索引中省略了一些术语。例如。在下面的字符串中,我可以搜索“版本号”以外的任何内容:

version-number "cAELimpts.spl SCOPE-PAY:10.1.10 25nov2013kw101730 Setup EMployee field if missing"

我尝试用Lucene.NET 3.1和pylucene 6.2.0实现它,结果相同。

以下是我在Lucene.NET中实现的一些细节:

using (var writer = new IndexWriter(FSDirectory.Open(INDEX_DIR), new CustomAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED))
{
  Console.Out.WriteLine("Indexing to directory '" + INDEX_DIR + "'...");
  IndexDirectory(writer, docDir);
  Console.Out.WriteLine("Optimizing...");
  writer.Optimize();
  writer.Commit();
}

CustomAnalyzer类:

public sealed class CustomAnalyzer : Analyzer
{
    public override TokenStream TokenStream(System.String fieldName, System.IO.TextReader reader)
    {
        return new LowerCaseFilter(new CustomTokenizer(reader));
    }
}

最后,CustomTokenizer类:

public class CustomTokenizer : CharTokenizer
{
    public CustomTokenizer(TextReader input) : base(input)
    {
    }

    public CustomTokenizer(AttributeFactory factory, TextReader input) : base(factory, input)
    {
    }

    public CustomTokenizer(AttributeSource source, TextReader input) : base(source, input)
    {
    }

    protected override bool IsTokenChar(char c)
    {
        return System.Char.IsLetterOrDigit(c) || c == '_' || c == '-' ;
    }
}

它看起来像“版本号”,而其他一些术语没有被编入索引,因为它们存在于99%的文档中。它可能是问题的原因吗?

编辑:根据要求,FileDocument类:

public static class FileDocument
{
    public static Document Document(FileInfo f)
    {

        // make a new, empty document
        Document doc = new Document();

        doc.Add(new Field("path", f.FullName, Field.Store.YES, Field.Index.NOT_ANALYZED));
        doc.Add(new Field("modified", DateTools.TimeToString(f.LastWriteTime.Millisecond, DateTools.Resolution.MINUTE), Field.Store.YES, Field.Index.NOT_ANALYZED));
        doc.Add(new Field("contents", new StreamReader(f.FullName, System.Text.Encoding.Default)));

        // return the document
        return doc;
    }
}

1 个答案:

答案 0 :(得分:0)

I think I was being an idiot. I was limiting the number of hits to 500 and then applying filters on the found hits. The items were expected to be retrieved in the order they had been indexed. So when I was looking for something at the end of the index, it would tell me that nothing was found. In fact, it would retrieve the expected 500 items but they would all have been filtered out.