如何使用lucene索引部署我自己的数据的carrot2 web-app集群

时间:2016-05-09 12:31:58

标签: carrot2

当我使用carrot2 web-app使用lucene索引聚类我自己的数据时,我发现结果与我的预期不一样。

错误一: [在右侧的结果列表中仅列出了没有匹配的文本段落和文件位置的群集文件名,我不确定是什么原因造成的,我想也许当我使用lucene创建索引文件时格式错误,或者是我的配置carrot2 web-app项目存在问题,我希望有人可以告诉我答案] [抱歉,我可以为此拍照,你可以看错图片中的图片。]

错误二: I found my search results showed that "other topics" not only a specific topic, it bothers me. I think there might be a problem clustering algorithm or is the topic of test data I have provided too little reason.

When I use the K-means clustering algorithm, the results came out a lot of topics, but no specific topic name but only the file name.

如果有人能回答我的疑惑,我会非常感激,你的回答会有所帮助。

这是我创建lucene索引文件的代码:

var A = ["Jon","Brad","Rachel"];
var B = ["Male","Male","Female"];
var C = [
  {"Jon","Male"},
  {"Brad","Male"},
  {"Rachel","Female"}

]

我的索引PDF文件代码(部分内容):

  package test2;

import org.apache.lucene.index.IndexFileNames;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;
import org.apache.lucene.util.Version;
import org.carrot2.source.lucene.SimpleFieldMapper;

import java.io.File;
import java.io.FileFilter;
import java.io.IOException;
import java.io.FileReader;


//lucene 4.9
public class LuceneDemo2 {
    public static void main(String[] args) throws Exception {
        String indexDir = "D:\\data\\lucene\\odp\\index-all";
        String dataDir = "D:\\data";

        long start = System.currentTimeMillis();
        LuceneDemo2 indexer = new LuceneDemo2(indexDir);

        int numIndexed;
        try {
            numIndexed = indexer.index(dataDir,new TextFilesFilter());
        } finally {
            indexer.close();
        }
        long end = System.currentTimeMillis();

        System.out.println("Indexing " + numIndexed + " files took " + (end-start) + " milliseconds.");
    }

    private IndexWriter writer;

    public LuceneDemo2(String indexDir) throws IOException {
        // TODO Auto-generated constructor stub
        Directory directory = FSDirectory.open(new File(indexDir));
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9);
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_9,analyzer);
        config.setOpenMode(OpenMode.CREATE);
        writer = new IndexWriter(directory,config);
    }

    public void close() throws IOException {
        writer.close();
    }

    public int index (String dataDir,FileFilter filter) throws Exception {
        File[] files = new File(dataDir).listFiles();

        //if(files == null) return writer.numDocs();
        for(File f: files) {
            if(!f.isDirectory()&&
                !f.isHidden()&&
                f.exists()&&
                f.canRead()&&
                (filter == null || filter.accept(f))) {
                indexFile(f);
            }
        }

        /*
        if(files == null) return writer.numDocs();
        for(int i=0;i<files.length&&files!=null;i++) {
            if(!files[i].isDirectory()&&
                !files[i].isHidden()&&
                files[i].exists()&&
                files[i].canRead()&&
                (filter == null || filter.accept(files[i]))) {
                indexFile(files[i]);
            }
        }
        */
        return writer.numDocs();
    }

    private static class TextFilesFilter implements FileFilter {
        public boolean accept(File path) {
            return path.getName().toLowerCase().endsWith(".txt");
        }   
    }

    private Document getDocument(File f) throws Exception {
        // TODO Auto-generated method stub
        Document document = new Document();
        document.add(new StringField("path",  f.getAbsolutePath(), Field.Store.YES));
        document.add(new LongField("modified", f.lastModified(), Field.Store.NO)); 
        document.add(new TextField("content", new FileReader(f)));
        document.add(new TextField("title", f.getName(), Field.Store.YES));

        return document;
    }

    private void indexFile(File f) throws Exception {
        // TODO Auto-generated method stub
        System.out.println("Indexing "+ f.getCanonicalPath());
        Document document = getDocument(f);
        writer.addDocument(document);
    }   
}

1 个答案:

答案 0 :(得分:1)

Carrot2算法对文档的原始文本进行操作,因此需要存储您想要聚类的所有内容字段(Field.Store.YES)。拥有你的内容&#34;存储在索引中的字段,最简单的解决方案是将相应文件的内容读入String,然后使用TextField类的String-based constructor

重新索引内容后,根据您的&#34;标题&#34;将Carrot2设置为群集。和&#34;内容&#34;你应该看到一些有意义的集群。