Tika in Action书中的例子Lucene StandardAnalyzer不起作用

时间:2014-03-30 12:59:17

标签: java lucene apache-tika

首先,对于Tika和Lucene来说,我是一个完全的菜鸟。我正在通过Tika in Action书中试用这些例子。在第5章中给出了这个例子:

package tikatest01;

import java.io.File;
import org.apache.tika.Tika;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexWriter;

public class LuceneIndexer {

    private final Tika tika;
    private final IndexWriter writer;

    public LuceneIndexer(Tika tika, IndexWriter writer) {
        this.tika = tika;
        this.writer = writer;
    }

    public void indexDocument(File file) throws Exception {
        Document document = new Document();
        document.add(new Field(
            "filename", file.getName(),
            Store.YES, Index.ANALYZED));
        document.add(new Field(
            "fulltext", tika.parseToString(file),
            Store.NO, Index.ANALYZED));
        writer.addDocument(document);
    }
}

这个主要方法:

package tikatest01;

import java.io.File;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.SimpleFSDirectory;
import org.apache.lucene.util.Version;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.tika.Tika;

public class TikaTest01 {

    public static void main(String[] args) throws Exception {

        String filename = "C:\\testdoc.pdf";
        File file = new File(filename);

        IndexWriter writer = new IndexWriter(
            new SimpleFSDirectory(file),
            new StandardAnalyzer(Version.LUCENE_30), 
            MaxFieldLength.UNLIMITED);
        try {
            LuceneIndexer indexer = new LuceneIndexer(new Tika(), writer);
            indexer.indexDocument(file);
            } 
        finally {
            writer.close();
            }
    }
}

我已将库tika-app-1.5.jar,lucene-core-4.7.0.jar和lucene-analyzers-common-4.7.0.jar添加到该项目中。

问题:

对于Lucene的当前版本,不推荐使用Field.Index,我应该使用什么呢?

找不到MaxFieldLength。我错过了一个导入?

2 个答案:

答案 0 :(得分:3)

对于Lucene 4.7,这个代码为索引器:

package tikatest01;

import java.io.File;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.tika.Tika;

public class LuceneIndexer {

    private final Tika tika;
    private final IndexWriter writer;

    public LuceneIndexer(Tika tika, IndexWriter writer) {
        this.tika = tika;
        this.writer = writer;
    }

    public void indexDocument(File file) throws Exception {
        Document document = new Document();
        document.add(new TextField(
                "filename", file.getName(), Store.YES));
        document.add(new TextField(
                "fulltext", tika.parseToString(file), Store.NO));
        writer.addDocument(document);
    }
}

主要类的代码:

package tikatest01;

import java.io.File;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.SimpleFSDirectory;
import org.apache.lucene.util.Version;
import org.apache.tika.Tika;

public class TikaTest01 {

    public static void main(String[] args) throws Exception {

        String dirname = "C:\\MyTestDir\\";
        File dir = new File(dirname);


        IndexWriter writer = new IndexWriter(
            new SimpleFSDirectory(dir), 
            new IndexWriterConfig(
                Version.LUCENE_47, 
                new StandardAnalyzer(Version.LUCENE_47)));
        try {
            LuceneIndexer indexer = new LuceneIndexer(new Tika(), writer);
            indexer.indexDocument(dir);
            } 
        finally {
            writer.close();
            }
    }
}

答案 1 :(得分:1)

对于Lucene 4.7,IndexWriter没有这种构造函数 看看API - http://lucene.apache.org/core/4_7_0/core/org/apache/lucene/index/IndexWriter.html

它只显示带有2个参数的构造函数,因此您需要将此示例应用于新的Lucene API