我正在为Questing Answering运行BERT和ALBERT的微调模型。而且,我正在根据SQuAD v2.0的部分问题评估这些模型的性能。我使用SQuAD's official evaluation script进行评估。
我使用Huggingface transformers
,在下面的内容中,您可以找到正在运行的实际代码和示例(对于某些尝试在SQuAD v2上运行ALBERT的微调模型的人来说,这可能也会有所帮助。 0):
tokenizer = AutoTokenizer.from_pretrained("ktrapeznikov/albert-xlarge-v2-squad-v2")
model = AutoModelForQuestionAnswering.from_pretrained("ktrapeznikov/albert-xlarge-v2-squad-v2")
question = "Why aren't the examples of bouregois architecture visible today?"
text = """Exceptional examples of the bourgeois architecture of the later periods were not restored by the communist authorities after the war (like mentioned Kronenberg Palace and Insurance Company Rosja building) or they were rebuilt in socialist realism style (like Warsaw Philharmony edifice originally inspired by Palais Garnier in Paris). Despite that the Warsaw University of Technology building (1899\u20131902) is the most interesting of the late 19th-century architecture. Some 19th-century buildings in the Praga district (the Vistula\u2019s right bank) have been restored although many have been poorly maintained. Warsaw\u2019s municipal government authorities have decided to rebuild the Saxon Palace and the Br\u00fchl Palace, the most distinctive buildings in prewar Warsaw."""
input_dict = tokenizer.encode_plus(question, text, return_tensors="pt")
input_ids = input_dict["input_ids"].tolist()
start_scores, end_scores = model(**input_dict)
all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1]).replace('▁', '')
print(answer)
输出如下:
[CLS] why aren ' t the examples of bour ego is architecture visible today ? [SEP] exceptional examples of the bourgeois architecture of the later periods were not restored by the communist authorities after the war
如您所见,答案中有BERT的特殊标记,包括[CLS]
和[SEP]
。
我了解,在答案仅为[CLS]
(tensor(0)
和start_scores
有两个end_scores
的情况下),这基本上意味着模型认为该问题没有答案在有意义的上下文中。在这种情况下,我只需要在运行评估脚本时将问题的答案设置为空字符串即可。
但是我想知道在上述示例中,我是否应该再次假定模型找不到答案并将答案设置为空字符串?还是在评估模型性能时只留下这样的答案?
我之所以问这个问题,是因为据我了解,如果我有答案的情况下使用评估脚本计算的性能可能会发生变化(如果我错了,请更正我),并且我可能不会对这些模型的性能。
答案 0 :(得分:1)
您应该简单地将它们视为无效,因为您尝试根据变量text
来预测正确的答案范围。其他所有内容均应无效。这也是拥抱treats这一预测的方式:
我们可以假设创建无效的预测,例如,预测跨度的起点在问题中。我们排除所有无效的预测。
您还应该注意,他们使用more sopisticated method来获取每个问题的预测(不要问我为什么在示例中为什么显示torch.argmax)。请看下面的例子:
from transformers.data.processors.squad import SquadResult, SquadExample, SquadFeatures,SquadV2Processor, squad_convert_examples_to_features
from transformers.data.metrics.squad_metrics import compute_predictions_logits, squad_evaluate
###
#your example code
###
outputs = model(**input_dict)
def to_list(tensor):
return tensor.detach().cpu().tolist()
output = [to_list(output[0]) for output in outputs]
start_logits, end_logits = output
all_results = []
all_results.append(SquadResult(1000000000, start_logits, end_logits))
#this is the answers section from the evaluation dataset
answers = [{'text':'not restored by the communist authorities', 'answer_start':77}, {'text':'were not restored', 'answer_start':72}, {'text':'not restored by the communist authorities after the war', 'answer_start':77}]
examples = [SquadExample('0', question, text, 'not restored by the communist authorities', 75, 'Warsaw', answers,False)]
#this does basically the same as tokenizer.encode_plus() but stores them in a SquadFeatures Object and splits if neccessary
features = squad_convert_examples_to_features(examples, tokenizer, 512, 100, 64, True)
predictions = compute_predictions_logits(
examples,
features,
all_results,
20,
30,
True,
'pred.file',
'nbest_file',
'null_log_odds_file',
False,
True,
0.0,
tokenizer
)
result = squad_evaluate(examples, predictions)
print(predictions)
for x in result.items():
print(x)
输出:
OrderedDict([('0', 'communist authorities after the war')])
('exact', 0.0)
('f1', 72.72727272727273)
('total', 1)
('HasAns_exact', 0.0)
('HasAns_f1', 72.72727272727273)
('HasAns_total', 1)
('best_exact', 0.0)
('best_exact_thresh', 0.0)
('best_f1', 72.72727272727273)
('best_f1_thresh', 0.0)