Question

我需要对Spacy NER所做的预测获得置信度得分。

CSV文件

Text,Amount & Nature,Percent of Class
"T. Rowe Price Associates, Inc.","28,223,360 (1)",8.7% (1)
100 E. Pratt Street,Not Listed,Not Listed
"Baltimore, MD 21202",Not Listed,Not Listed
"BlackRock, Inc.","21,871,854 (2)",6.8% (2)
55 East 52nd Street,Not Listed,Not Listed
"New York, NY 10022",Not Listed,Not Listed
The Vanguard Group,"21,380,085 (3)",6.64% (3)
100 Vanguard Blvd.,Not Listed,Not Listed
"Malvern, PA 19355",Not Listed,Not Listed
FMR LLC,"20,784,414 (4)",6.459% (4)
245 Summer Street,Not Listed,Not Listed
"Boston, MA 02210",Not Listed,Not Listed

代码

import pandas as pd
import spacy
with open('/path/table.csv') as csvfile:
    reader1 = csv.DictReader(csvfile)
    data1 =[["Text","Amount & Nature","Prediction"]]
    for row in reader1:
        AmountNature = row["Amount & Nature"]
        nlp = spacy.load('en_core_web_sm') 
        doc1 = nlp(row["Text"])

        for ent in doc1.ents:
            #output = [ent.text, ent.start_char, ent.end_char, ent.label_]
            label1 = ent.label_
            text1 = ent.text
        data1.append([str(doc1),AmountNature,label1])
my_df1 = pd.DataFrame(data1)
my_df1.columns = my_df1.iloc[0]
my_df1 = my_df1.drop(my_df1.index[[0]])
my_df1.to_csv('/path/output.csv', index=False, header=["Text","Amount & Nature","Prediction"])

输出CSV

Text,Amount & Nature,Prediction
"T. Rowe Price Associates, Inc.","28,223,360 (1)",ORG
100 E. Pratt Street,Not Listed,FAC
"Baltimore, MD 21202",Not Listed,CARDINAL
"BlackRock, Inc.","21,871,854 (2)",ORG
55 East 52nd Street,Not Listed,LOC
"New York, NY 10022",Not Listed,DATE
The Vanguard Group,"21,380,085 (3)",ORG
100 Vanguard Blvd.,Not Listed,FAC
"Malvern, PA 19355",Not Listed,DATE
FMR LLC,"20,784,414 (4)",ORG
245 Summer Street,Not Listed,CARDINAL
"Boston, MA 02210",Not Listed,GPE

在上面的输出中，是否有可能获得Spacy NER预测的可信度得分。如果是，该如何实现？

有人可以帮我吗？

Answer 1

要么获取一个完全注释的数据集，要么自己手动注释（看到您有CSV文件，这可能是您的首选）。这样，您就可以将基本事实与Spacy的预测区分开。基于此，您可以计算confusion matrix。我建议使用F1分数来衡量信心。

are some great links在这里讨论各种公开可用的数据集和注释方法（包括CRF）。

Answer 2

否，无法在Spacy中获得模型的置信度分数（不幸的是）。如本期#881所述，即使使用<c:forEach items="${list}" var="item"> <c:out value="${empDetails[item]}" /> </c:forEach>，也可能获得分数，尽管它似乎是thread中提到的问题。

尽管使用F1分数可以进行总体评估，但我希望Spacy为自己的预测提供个人的置信度分数，而目前尚无法提供。

Answer 3

尽管目前尚无官方支持的API，但您可以使用this discussion中的代码从光束搜索中获得置信度得分：

text = content
doc = nlp.make_doc(text)
beams = nlp.entity.beam_parse([doc], beam_width=16, beam_density=0.0001)
for score, ents in nlp.entity.moves.get_beam_parses(beams[0]):
    print (score, ents)
    entity_scores = defaultdict(float)
    for start, end, label in ents:
        # print ("here")
        entity_scores[(start, end, label)] += score
        print ('entity_scores', entity_scores)

Answer 4

对此没有直接的解释。首先，spaCy 实现了两个不同的命名实体解析目标：

贪婪的模仿学习目标。此目标询问：“如果我从该状态执行，哪些可用操作不会引入新错误？”
全局波束搜索目标。全局模型不是优化单个转换决策，而是询问最终解析是否正确。为了优化这个目标，我们构建了 top-k 最有可能不正确解析和 top-k 最可能正确解析的集合。

请从 here 中找到完整的解释和代码灵感。

注意：在 spaCy v2.0.13

上测试过

import spacy
import sys
from collections import defaultdict

nlp = spacy.load('en')
text = 'Hi there! Hope you are doing good. Greetings from India.'

with nlp.disable_pipes('ner'):
    doc = nlp(text)

threshold = 0.2
# Number of alternate analyses to consider. More is slower, and not necessarily better -- you need to experiment on your problem.
beam_width = 16
# This clips solutions at each step. We multiply the score of the top-ranked action by this value, and use the result as a threshold. This prevents the parser from exploring options that look very unlikely, saving a bit of efficiency. Accuracy may also improve, because we've trained on greedy objective.
beam_density = 0.0001 
beams, _ = nlp.entity.beam_parse([ doc ], beam_width, beam_density)

entity_scores = defaultdict(float)
for beam in beams:
    for score, ents in nlp.entity.moves.get_beam_parses(beam):
        for start, end, label in ents:
            entity_scores[(start, end, label)] += score
            
for key in entity_scores:
    start, end, label = key
    score = entity_scores[key]
    if score > threshold:
        print ('Label: {}, Text: {}, Score: {}'.format(label, doc[start:end], score))

输出：

Label: GPE, Text: India, Score: 0.9999509961251819

是否可以获得Spacy命名实体识别的置信度得分

4 个答案: