阿拉伯分析仪Lucene

时间:2013-11-30 19:43:51

标签: java lucene

我正在尝试使用Apache Lucene提供的ArabicAnalyzer索引阿拉伯语文本文件。以下代码显示了我要做的事情:

public class Indexer {
    public static void main(String[] args) throws Exception {

        String indexDir = "E:/workspace/IRThesisCorpusByApacheLucene/indexDir";
        String dataDir = "E:/workspace/IRThesisCorpusByApacheLucene/dataDir";
        long start = System.currentTimeMillis();
        Indexer indexer = new Indexer(indexDir);
        int numIndexed;
        try {
            numIndexed = indexer.index(dataDir, new TextFilesFilter());
        } finally {
            indexer.close();
        }
        long end = System.currentTimeMillis();
        System.out.println("Indexing " + numIndexed + " files took "
            + (end - start) + " milliseconds");
    }

    private IndexWriter writer;
    public Indexer(String indexDir) throws IOException {
        Directory dir = FSDirectory.open(new File(indexDir));

        writer = new IndexWriter(dir,new IndexWriterConfig
            (Version.LUCENE_45, new ArabicAnalyzer(Version.LUCENE_45))
        );
    }

    public void close() throws IOException {
        writer.close();
    }

    public int index(String dataDir, FileFilter filter)
            throws Exception {
        System.out.println(" Dir Path :::::"+ new File(dataDir).getAbsolutePath());
        File[] files = new File(dataDir).listFiles();
        System.out.println(" Files number :::::"+files.length);
        for (File f: files) {
            System.out.println(" File is :::::"+f);
            if (!f.isDirectory() &&
                    !f.isHidden() &&
                    f.exists() &&
                    f.canRead() &&
                    (filter == null || filter.accept(f))) {
                indexFile(f);
            }
        }
        return writer.numDocs();
    }

    private static class TextFilesFilter implements FileFilter {
        public boolean accept(File path) {
            return path.getName().toLowerCase()
                .endsWith(".txt");
        }
    }

    protected Document getDocument(File f) throws Exception {
        Document doc = new Document();

        InputStreamReader reader=new InputStreamReader
                (new FileInputStream(f),"UTF8");
        System.out.println(" Encoding is ::::"+reader.getEncoding());

        doc.add(new TextField("contents",reader ));
        doc.add(new TextField("filename", f.getName(),
                Field.Store.YES));
        doc.add(new TextField("fullpath", f.getCanonicalPath(),
                Field.Store.YES));

        return doc;
    }

    private void indexFile(File f) throws Exception {
        System.out.println("Indexing " + f.getCanonicalPath());
        Document doc = getDocument(f);
        System.out.println(" In indexFile :::::::: doc is ::"+doc+" writer:::"+writer); 
        writer.addDocument(doc,new ArabicAnalyzer(Version.LUCENE_45));
    }
}

我的文字文件包含:

{سم الله الرحمن الرحيم 
     

اهلاوسهل​​ابكم,ماذابعد   كتبيكتبكاتبمكتوبسيكتب}

运行时,我在_0.cfs文件中得到以下结果:screenshot of _0.cfs

我得到了单词,但也得到了未定义的字符

这是什么问题?为什么不正确显示阿拉伯语?

1 个答案:

答案 0 :(得分:0)

您不应直接查看.cfs个文件。 cfs是复合索引文件,绝不是纯文本文档。您打算使用Lucene API从索引中搜索和检索数据,而不仅仅是在编辑器中查看文件。如果您想了解有关Lucene文件格式的更多信息,请随时查看codec documentation