我正在尝试使用Apache Lucene提供的ArabicAnalyzer
索引阿拉伯语文本文件。以下代码显示了我要做的事情:
public class Indexer {
public static void main(String[] args) throws Exception {
String indexDir = "E:/workspace/IRThesisCorpusByApacheLucene/indexDir";
String dataDir = "E:/workspace/IRThesisCorpusByApacheLucene/dataDir";
long start = System.currentTimeMillis();
Indexer indexer = new Indexer(indexDir);
int numIndexed;
try {
numIndexed = indexer.index(dataDir, new TextFilesFilter());
} finally {
indexer.close();
}
long end = System.currentTimeMillis();
System.out.println("Indexing " + numIndexed + " files took "
+ (end - start) + " milliseconds");
}
private IndexWriter writer;
public Indexer(String indexDir) throws IOException {
Directory dir = FSDirectory.open(new File(indexDir));
writer = new IndexWriter(dir,new IndexWriterConfig
(Version.LUCENE_45, new ArabicAnalyzer(Version.LUCENE_45))
);
}
public void close() throws IOException {
writer.close();
}
public int index(String dataDir, FileFilter filter)
throws Exception {
System.out.println(" Dir Path :::::"+ new File(dataDir).getAbsolutePath());
File[] files = new File(dataDir).listFiles();
System.out.println(" Files number :::::"+files.length);
for (File f: files) {
System.out.println(" File is :::::"+f);
if (!f.isDirectory() &&
!f.isHidden() &&
f.exists() &&
f.canRead() &&
(filter == null || filter.accept(f))) {
indexFile(f);
}
}
return writer.numDocs();
}
private static class TextFilesFilter implements FileFilter {
public boolean accept(File path) {
return path.getName().toLowerCase()
.endsWith(".txt");
}
}
protected Document getDocument(File f) throws Exception {
Document doc = new Document();
InputStreamReader reader=new InputStreamReader
(new FileInputStream(f),"UTF8");
System.out.println(" Encoding is ::::"+reader.getEncoding());
doc.add(new TextField("contents",reader ));
doc.add(new TextField("filename", f.getName(),
Field.Store.YES));
doc.add(new TextField("fullpath", f.getCanonicalPath(),
Field.Store.YES));
return doc;
}
private void indexFile(File f) throws Exception {
System.out.println("Indexing " + f.getCanonicalPath());
Document doc = getDocument(f);
System.out.println(" In indexFile :::::::: doc is ::"+doc+" writer:::"+writer);
writer.addDocument(doc,new ArabicAnalyzer(Version.LUCENE_45));
}
}
我的文字文件包含:
{سم الله الرحمن الرحيم
اهلاوسهلابكم,ماذابعد كتبيكتبكاتبمكتوبسيكتب}
运行时,我在_0.cfs文件中得到以下结果:
我得到了单词,但也得到了未定义的字符
这是什么问题?为什么不正确显示阿拉伯语?
答案 0 :(得分:0)
您不应直接查看.cfs
个文件。 cfs是复合索引文件,绝不是纯文本文档。您打算使用Lucene API从索引中搜索和检索数据,而不仅仅是在编辑器中查看文件。如果您想了解有关Lucene文件格式的更多信息,请随时查看codec documentation