Question

我有两个文件，我必须迭代并计算我的命名实体标记器的精度和召回率。一个文件是黄金集，另一个是我系统的输出。我只是想了解如何迭代这两个文件中的句子并计算完全匹配和部分匹配的数量。我只想计算组织，人和地点的匹配。伪代码或只是让我入门的想法会很有效。

文件1：黄金套装

Sentence 1:
{ORGANIZATION=[Fulton County Grand Jury]}
Sentence 2:
{ORGANIZATION=[City Executive Committee]}
{LOCATION=[City of Atlanta]}
Sentence 3:
{LOCATION=[Fulton]}
{PERSON=[Superior Court Judge Durwood Pye]}
{PERSON=[Mayor-nominate Ivan Allen Jr.]}
Sentence 4:
Sentence 5:
Sentence 6:
{LOCATION=[Fulton]}
Sentence 7:
{LOCATION=[Fulton County]}
Sentence 8:
Sentence 9:
{ORGANIZATION=[City Purchasing Department]}
Sentence 10:
Sentence 11:
Sentence 12:
{ORGANIZATION=[State Welfare Department]}
Sentence 13:
{LOCATION=[Fulton County]}
{ORGANIZATION=[State Welfare Department]}
{LOCATION=[Fulton County]}

文件2：我的输出

Sentence 1:
{ORGANIZATION=[Fulton County Grand Jury], DATE=[Friday], LOCATION=[Atlanta]}
Sentence 2:
{ORGANIZATION=[City Executive Committee], LOCATION=[Atlanta]}
Sentence 3:
{ORGANIZATION=[Fulton Superior Court Judge Durwood Pye], DATE=[September October], PERSON=[Ivan Allen Jr.]}
Sentence 4:
Sentence 5:
{LOCATION=[Georgia]}
Sentence 6:
Sentence 7:
{LOCATION=[Atlanta, Fulton County]}
Sentence 8:
Sentence 9:
{ORGANIZATION=[City Purchasing Department]}
Sentence 10:
{LOCATION=[Georgia]}
Sentence 11:
Sentence 12:
{ORGANIZATION=[State Welfare Department]}
Sentence 13:
{ORGANIZATION=[State Welfare Department], LOCATION=[Fulton County, Fulton County]}

Answer 1

您可以按照以下步骤解析文件并收集所需的数据。下面将获取所有组织。

    Scanner scanner = new Scanner(new File("path-to-file"));
    List<String> orgLines = new ArrayList<String>();
    while(scanner.hasNextLine()){
        String line = scanner.nextLine();
        if(line.startsWith("{ORGANIZATION")){
            orgLines.add(line);
        }
    }

获得两个文件的结果后，您可以使用retainAll查找完整匹配。

orgLines.retainAll(orgLines2);

对于部分匹配，您需要遍历所有条目并根据匹配逻辑进行计算。

Answer 2

如果您使用的是Stanford NER，为什么不使用内置命令来测试分类器？

java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier path/to/ner-model.ser.gz -testFile gold-annotated-text.tsv

您必须将您的黄金套装更改为this格式。

参考：http://nlp.stanford.edu/software/crf-faq.html#a

迭代包含命名实体映射的两个文件并计算精度和召回率

2 个答案: