Question

我实施了一个程序，根据用户输入的TFIDF相似度得分对文档进行排名。

以下是该计划：

public class Ranking{

    private static int maxHits = 10;
    private static Connection connect = null;
    private static PreparedStatement preparedStatement = null;
    private static ResultSet resultSet = null;

    public static void main(String[] args) throws Exception {        
        System.out.println("Enter your paper title: ");
        BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
        String paperTitle = null;
        paperTitle = br.readLine(); 

        Class.forName("com.mysql.jdbc.Driver");
        connect = DriverManager.getConnection("jdbc:mysql://localhost/arnetminer?"
                  + "user=root&password=1234");
        preparedStatement = connect.prepareStatement
        ("SELECT stoppedstemmedtitle from arnetminer.new_bigdataset "
                + "where title="+"'"+paperTitle+"';");
        resultSet = preparedStatement.executeQuery();
        resultSet.next();
        String stoppedstemmedtitle = resultSet.getString(1);

        String querystr = args.length > 0 ? args[0] :stoppedstemmedtitle;
        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
        Query q = new QueryParser(Version.LUCENE_42, "stoppedstemmedtitle", analyzer).parse(querystr);

        IndexReader reader = DirectoryReader.open(FSDirectory.open(new File("E:/Lucene/new_bigdataset_index")));        
        IndexSearcher searcher = new IndexSearcher(reader);

        VSMSimilarity vsmSimiliarty = new VSMSimilarity();  
        searcher.setSimilarity(vsmSimiliarty);
        TopDocs hits = searcher.search(q, maxHits);
        ScoreDoc[] scoreDocs = hits.scoreDocs;

        PrintWriter writer = new PrintWriter("E:/Lucene/result/1.txt", "UTF-8");

        int counter = 0;
        for (int n = 0; n < scoreDocs.length; ++n) {
            ScoreDoc sd = scoreDocs[n];
            System.out.println(scoreDocs[n]);
            float score = sd.score;
            int docId = sd.doc;
            Document d = searcher.doc(docId);
            String fileName = d.get("title");
            String year = d.get("pub_year");
            String paperkey = d.get("paperkey");
            System.out.printf("%s,%s,%s,%4.3f\n", paperkey, fileName, year, score);
            writer.printf("%s,%s,%s,%4.3f\n", paperkey, fileName, year, score);
        ++counter;
        }
        writer.close();

    }


}

和

public class VSMSimilarity extends DefaultSimilarity{

    // Weighting codes
    public boolean doBasic     = true;  // Basic tf-idf
    public boolean doSublinear = false; // Sublinear tf-idf
    public boolean doBoolean   = false; // Boolean

    //Scoring codes
    public boolean doCosine    = true;
    public boolean doOverlap   = false;

    // term frequency in document = measure of how often a term appears in the document
    public float tf(int freq) {     

        return super.tf(freq);
    }

    // inverse document frequency = measure of how often the term appears across the index
    public float idf(int docFreq, int numDocs) {

        // The default behaviour of Lucene is 1 + log (numDocs/(docFreq+1)), which is what we want (default VSM model)
        return super.idf(docFreq, numDocs); 
    }

    // normalization factor so that queries can be compared 
    public float queryNorm(float sumOfSquaredWeights){

        return super.queryNorm(sumOfSquaredWeights);
    }

    // number of terms in the query that were found in the document
    public float coord(int overlap, int maxOverlap) {

        // else: can't get here
        return super.coord(overlap, maxOverlap);
    }

    // Note: this happens an index time, which we don't take advantage of (too many indices!)
    public float computeNorm(String fieldName, FieldInvertState state){

        // else: can't get here
        return super.computeNorm(state);
    }
}

但是，对于与输入具有100％相似性的精确文档，它不会返回值1。

如果我按如下方式输入用户输入：Logic Based Knowledge Representation 我得到的输出和TFIDF得分是（对于与输入具有100％相似性的文档，为5.165）：

3086,Logic Based Knowledge Representation.,1999,5.165
33586,A Logic for the Representation of Spatial Knowledge.,1991,4.663
328937,Logic Programming for Knowledge Representation.,2007,4.663
219720,Logic for Knowledge Representation.,1984,4.663
487587,Knowledge Representation with Logic Programs.,1997,4.663
806195,Logic Programming as a Representation of Knowledge.,1983,4.663
806833,The Role of Logic in Knowledge Representation.,1983,4.663
744914,Knowledge Representation and Logic Programming.,2002,4.663
1113802,Knowledge Representation in Fuzzy Logic.,1989,4.663
984276,Logic Programming and Knowledge Representation.,1994,4.663

这是正常的事情还是我的tfidf实施有问题？

非常感谢！

Answer 1

首先 - Lucene已经具有TF-IDF相似性 - org.apache.lucene.search.similarities.TFIDFSimilarity

第二个 -

tf-idf，术语频率 - 逆文档频率的缩写，是a 数值统计，旨在反映字的重要性是集合或语料库中的文档

我已经标记了单词，所以这个tf-idf的东西只适用于一个单词查询，但是当查询有多个单词时，tf-idf会像这样完成：

最简单的排名函数之一是通过求和来计算的每个查询字词的tf-idf

所以，这就是为什么tf-idf可以为你提供超过1分的原因

对于与某些文档完全相同的查询，Lucene TFIDF不返回1

1 个答案: