Question

我正在尝试为小型语料库创建Term-Document矩阵，以进一步试验LSI。但是，我找不到用Lucene 4.4做的方法。

我知道如何为每个文档获取TermVector，如下所示：

//create boolean query to search for a specific document (not shown)
TopDocs hits = searcher.search(query, 1);    
Terms termVector = reader.getTermVector(hits.scoreDocs[0].doc, "contents");
System.out.println(termVector.size());  //just testing

我认为我可以将所有termVector结合在一起作为矩阵中的列来获取矩阵。但是，不同文档的termVector具有不同的大小。我们不知道如何将0填入termVector。所以，当然，这种方法不起作用。

因此，我想知道是否有人可以告诉我如何使用Lucene 4.4创建Term-Document向量？（如果可能，请显示示例代码）。
如果Lucene不支持此功能，您推荐的其他方式是什么？

非常感谢，

Answer 1

我找到了问题here的解决方案。 Sujit先生给出了非常详细的例子，虽然代码是用较旧版本的Lucene编写的，所以很多东西都要改变。我完成代码后会更新详细信息。这是我的解决方案，适用于Lucene 4.4

public class BuildTermDocumentMatrix {
public BuildTermDocumentMatrix(File index, File corpus) throws IOException{
    reader = DirectoryReader.open(FSDirectory.open(index));
    searcher = new IndexSearcher(reader);
    this.corpus = corpus;
    termIdMap = computeTermIdMap(reader);
}   
/**
*  Map term to a fix integer so that we can build document matrix later.
*  It's used to assign term to specific row in Term-Document matrix
*/
private Map<String, Integer> computeTermIdMap(IndexReader reader) throws IOException {
    Map<String,Integer> termIdMap = new HashMap<String,Integer>();
    int id = 0;
    Fields fields = MultiFields.getFields(reader);
    Terms terms = fields.terms("contents");
    TermsEnum itr = terms.iterator(null);
    BytesRef term = null;
    while ((term = itr.next()) != null) {               
        String termText = term.utf8ToString();              
        if (termIdMap.containsKey(termText))
            continue;
        //System.out.println(termText); 
        termIdMap.put(termText, id++);
    }

    return termIdMap;
}

/**
*  build term-document matrix for the given directory
*/
public RealMatrix buildTermDocumentMatrix () throws IOException {
    //iterate through directory to work with each doc
    int col = 0;
    int numDocs = countDocs(corpus);            //get the number of documents here      
    int numTerms = termIdMap.size();    //total number of terms     
    RealMatrix tdMatrix = new Array2DRowRealMatrix(numTerms, numDocs);

    for (File f : corpus.listFiles()) {
        if (!f.isHidden() && f.canRead()) {
            //I build term document matrix for a subset of corpus so
            //I need to lookup document by path name. 
            //If you build for the whole corpus, just iterate through all documents
            String path = f.getPath();
            BooleanQuery pathQuery = new BooleanQuery();
            pathQuery.add(new TermQuery(new Term("path", path)), BooleanClause.Occur.SHOULD);
            TopDocs hits = searcher.search(pathQuery, 1);

            //get term vector
            Terms termVector = reader.getTermVector(hits.scoreDocs[0].doc, "contents");
            TermsEnum itr = termVector.iterator(null);
            BytesRef term = null;

            //compute term weight
            while ((term = itr.next()) != null) {               
                String termText = term.utf8ToString();              
                int row = termIdMap.get(termText);
                long termFreq = itr.totalTermFreq();
                long docCount = itr.docFreq();
                double weight = computeTfIdfWeight(termFreq, docCount, numDocs);
                tdMatrix.setEntry(row, col, weight);
            }
            col++;
        }
    }       
    return tdMatrix;
}
}

Answer 2

也可以参考此代码。在最新的Lucene版本中，这将非常容易。

Example 15
public void testSparseFreqDoubleArrayConversion() throws Exception {
    Terms fieldTerms = MultiFields.getTerms(index, "text");
    if (fieldTerms != null && fieldTerms.size() != -1) {
      IndexSearcher indexSearcher = new IndexSearcher(index);
      for (ScoreDoc scoreDoc : indexSearcher.search(new MatchAllDocsQuery(), Integer.MAX_VALUE).scoreDocs) {
        Terms docTerms = index.getTermVector(scoreDoc.doc, "text");
        Double[] vector = DocToDoubleVectorUtils.toSparseLocalFreqDoubleArray(docTerms, fieldTerms);
        assertNotNull(vector);
        assertTrue(vector.length > 0);
      }
    }
  }

使用Lucene 4.4生成Term-Document矩阵

2 个答案: