Question

我在Lucene 5.2.1中开发了自己的索引器。我试图索引1.5 GB的维度文件，我需要在索引时间内对集合的每个文档进行一些非平凡的计算。

问题是所有索引都需要将近20分钟！我跟着这个非常有帮助的wiki，但它仍然太慢了。我曾尝试增加Eclipse堆空间和java VM内存，但它似乎更多的是硬盘而不是虚拟内存（我使用的是6GB或RAM和普通硬盘的笔记本电脑）。

我已阅读此discussion，建议使用RAMDirectory或安装RAM磁盘。 RAM磁盘的问题是在我的文件系统中持久化索引（我不想在重启后丢失索引）。相反，RAMDirectory的问题在于，根据API，我不应该使用它，因为我的索引超过了数百兆比特......＃/ p>

警告：此类不适用于大型索引。超过几百兆字节的所有内容都会浪费资源（GC周期），因为它使用1024字节的内部缓冲区，产生数百万字节[1024]数组。此类针对小型内存驻留索引进行了优化。它在多线程环境中也具有错误的并发性。

在这里你可以找到我的代码：

public class ReviewIndexer {

private JSONParser parser;
private PerFieldAnalyzerWrapper reviewAnalyzer;
private IndexWriterConfig iwConfig;
private IndexWriter indexWriter;

public ReviewIndexer() throws IOException{
    parser = new JSONParser();
    reviewAnalyzer = new ReviewWrapper().getPFAWrapper();
    iwConfig = new IndexWriterConfig(reviewAnalyzer);
    //change ram buffer size to speed things up
    //@url https://wiki.apache.org/lucene-java/ImproveIndexingSpeed
    iwConfig.setRAMBufferSizeMB(2048);
    //little speed increase
    iwConfig.setUseCompoundFile(false);
    //iwConfig.setMaxThreadStates(24);
    // Set to overwrite the existing index
    indexWriter = new IndexWriter(FileUtils.openDirectory("review_index"), iwConfig);
}

/**
 * Indexes every review. 
 * @param file_path : the path of the yelp_academic_dataset_review.json file
 * @throws IOException
 * @return Returns true if everything goes fine.
 */
public boolean indexReviews(String file_path) throws IOException{
    BufferedReader br;
    try {
        //open the file
        br = new BufferedReader(new FileReader(file_path));
        String line;
        //define fields
        StringField type = new StringField("type", "", Store.YES);
        String reviewtext = "";
        TextField text = new TextField("text", "", Store.YES);
        StringField business_id = new StringField("business_id", "", Store.YES);
        StringField user_id = new StringField("user_id", "", Store.YES);
        LongField stars = new LongField("stars", 0, LanguageUtils.LONG_FIELD_TYPE_STORED_SORTED);
        LongField date = new LongField("date", 0, LanguageUtils.LONG_FIELD_TYPE_STORED_SORTED);
        StringField votes = new StringField("votes", "", Store.YES);
        Date reviewDate;
        JSONObject jsonVotes;
        try {
            indexWriter.deleteAll();
            //scan the file line by line
            //TO-DO: split in chunks and use parallel computation
            while ((line = br.readLine()) != null) {
                try {
                    JSONObject jsonline = (JSONObject) parser.parse(line);
                    Document review = new Document();
                    //add values to fields
                    type.setStringValue((String) jsonline.get("type"));
                    business_id.setStringValue((String) jsonline.get("business_id"));
                    user_id.setStringValue((String) jsonline.get("user_id"));
                    stars.setLongValue((long) jsonline.get("stars"));
                    reviewtext = (String) jsonline.get("text");
                    //non-trivial function being calculated here
                    text.setStringValue(reviewtext);
                    reviewDate = DateTools.stringToDate((String) jsonline.get("date"));
                    date.setLongValue(reviewDate.getTime());
                    jsonVotes = (JSONObject) jsonline.get("votes");
                    votes.setStringValue(jsonVotes.toJSONString());
                    //add fields to document
                    review.add(type);
                    review.add(business_id);
                    review.add(user_id);
                    review.add(stars);
                    review.add(text);
                    review.add(date);
                    review.add(votes);
                    //write the document to index
                    indexWriter.addDocument(review);
                } catch (ParseException | java.text.ParseException e) {
                    e.printStackTrace();
                    br.close();
                    return false;
                }
            }//end of while
        } catch (IOException e) {
            e.printStackTrace();
            br.close();
            return false;
        }
        //close buffer reader and commit changes
        br.close();
        indexWriter.commit();
    } catch (FileNotFoundException e1) {
            e1.printStackTrace();
            return false;
    }
    System.out.println("Done.");
    return true;
}

public void close() throws IOException {
    indexWriter.close();
}

}

那么最好的事情是什么？我应该构建一个RAM磁盘，然后在完成后将索引复制到FileSystem，或者我应该使用RAMDirectory - 或者其他可能的东西？非常感谢

Answer 1

您可以在IndexWriterConfig

中尝试setMaxTreadStates

iwConfig.setMaxThreadStates（50）;

优化索引lucene 5.2.1

1 个答案: