lucene中精度和召回测量的问题

时间:2010-12-01 08:24:51

标签: java lucene

我需要在lucene中计算精度和召回值,并使用此源代码来执行此操作

public class PrecisionRecall {

public static void main(String[] args) throws Throwable {

File topicsFile = new File("C:/Users/Raden/Documents/lucene/LuceneHibernate/LIA/lia2e/src/lia/benchmark/topics.txt");
File qrelsFile = new File("C:/Users/Raden/Documents/lucene/LuceneHibernate/LIA/lia2e/src/lia/benchmark/qrels.txt");
Directory dir = FSDirectory.open(new File("C:/Users/Raden/Documents/myindex"));
Searcher searcher = new IndexSearcher(dir, true);

String docNameField = "filename"; 

PrintWriter logger = new PrintWriter(System.out, true); 

TrecTopicsReader qReader = new TrecTopicsReader();   //#1
QualityQuery qqs[] = qReader.readQueries(            //#1
    new BufferedReader(new FileReader(topicsFile))); //#1

Judge judge = new TrecJudge(new BufferedReader(      //#2
    new FileReader(qrelsFile)));                     //#2

judge.validateData(qqs, logger);                     //#3

QualityQueryParser qqParser = new SimpleQQParser("title", "contents");  //#4

QualityBenchmark qrun = new QualityBenchmark(qqs, qqParser, searcher, docNameField);
SubmissionReport submitLog = null;
QualityStats stats[] = qrun.execute(judge,           //#5
        submitLog, logger);

QualityStats avg = QualityStats.average(stats);      //#6
avg.log("SUMMARY",2,logger, "  ");
dir.close();
}
}

这里是topicsfile

的内容
 <top>
<num> Number: 0
<title> apache source
<desc> Description:
<narr> Narrative:
</top>

这是qrelsfile的内容

# Format:
#
#       qnum   0   doc-name     is-relevant
#
#

0    0   apache1.0.txt       1
0    0   apache1.1.txt       1
0    0   apache2.0.txt       1

现在当我运行显示精度值并且调用为零的源代码时会出现问题。这是我运行源代码时的结果。

0  -  contents:apache contents:source

0 Stats:
Search Seconds:         0.047
DocName Seconds:        0.039
Num Points:            56.000
Num Good Points:        0.000
Max Good Points:        3.000
Average Precision:      0.000
MRR:                    0.000
Recall:                 0.000
Precision At 1:         0.000
Precision At 2:         0.000
Precision At 3:         0.000
Precision At 4:         0.000
Precision At 5:         0.000
Precision At 6:         0.000
Precision At 7:         0.000
Precision At 8:         0.000
Precision At 9:         0.000
Precision At 10:        0.000
Precision At 11:        0.000
Precision At 12:        0.000
Precision At 13:        0.000
Precision At 14:        0.000
Precision At 15:        0.000
Precision At 16:        0.000
Precision At 17:        0.000
Precision At 18:        0.000
Precision At 19:        0.000
Precision At 20:        0.000



SUMMARY
Search Seconds:         0.047
DocName Seconds:        0.039
Num Points:            56.000
Num Good Points:        0.000
Max Good Points:        3.000
Average Precision:      0.000
MRR:                    0.000
Recall:                 0.000
Precision At 1:         0.000
Precision At 2:         0.000
Precision At 3:         0.000
Precision At 4:         0.000
Precision At 5:         0.000
Precision At 6:         0.000
Precision At 7:         0.000
Precision At 8:         0.000
Precision At 9:         0.000
Precision At 10:        0.000
Precision At 11:        0.000
Precision At 12:        0.000
Precision At 13:        0.000
Precision At 14:        0.000
Precision At 15:        0.000
Precision At 16:        0.000
Precision At 17:        0.000
Precision At 18:        0.000
Precision At 19:        0.000
Precision At 20:        0.000

现在你能告诉我我做错了什么使精度和召回值变为零?当精度和召回值为零时,它意味着什么?我这样做的原因是因为我需要测量搜索引擎的性能,精确度和召回率是我实现它的方法之一。

谢谢

2 个答案:

答案 0 :(得分:1)

精度= 0表示没有一个结果是正确的。例如,请参阅the wikipedia article

我建议尝试单个查询,看看你的结果是什么。您的令牌化程序可能存在问题;也许你没有把事情包好等等。

答案 1 :(得分:1)

我认为问题在于索引程序。如果你看 好在

QualityBenchmark qrun = 
       new QualityBenchmark(qqs, qqParser, searcher, docNameField);

您会看到针对 查询 的匹配启动了搜索 和文档 名称 (= Lucene在Lucene索引中查找名称"filename"字段中的值)。

这意味着当您编制索引时,您需要创建一个 显式文档字段 ,它将.txt文件的ID存储在您的语料库中(在您的情况下,他们的姓名),例如声明

public static final String FIELD_NAME = "filename";

然后再

document.add(new TextField(FIELD_NAME, "apache1.0.txt", Field.Store.YES));

和其他2个文件类似。否则它无法参考 命中配置文件中的名称。我有同样的问题,但在我添加新的自定义字段后,它就像一个魅力: - )

N.B。两个基准配置文件的格式基于TREC9格式;可以在http://trec.nist.gov/data/qrels_eng/找到示例qrels.txt文件 以及http://trec.nist.gov/data/topics_eng/topics.501-550.txt处的示例topics.txt文件。