BERT-修改run_squad.py预测文件

时间:2019-06-20 13:12:38

标签: python json python-3.x tensorflow bert-language-model

我是BERT的新手,并且我正在尝试编辑run_squad.py的输出以构建问答系统并获取具有以下结构的输出文件

{
    "data": [
      {
            "id": "ID1",
            "title": "Alan_Turing",
            "question": "When Alan Turing was born?",
            "context": "Alan Mathison Turing (23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist. [...] . However, both Julius and Ethel wanted their children to be brought up in Britain, so they moved to Maida Vale, London, where Alan Turing was born on 23 June 1912, as recorded by a blue plaque on the outside of the house of his birth, later the Colonnade Hotel. Turing had an elder brother, John (the father of Sir John Dermot Turing, 12th Baronet of the Turing baronets).",
            "answers": [
              {"text": "on 23 June 1912",   "probability": 0.891726, "start_logit": 4.075,  "end_logit": 4.15},
              {"text": "on 23 June", "probability": 0.091726, "start_logit": 2.075, "end_logit": 1.15},
              {"text": "June 1912", "probability": 0.051726, "start_logit": 1.075, "end_logit": 0.854}
            ]
        },
        {
            "id": "ID2",
            "title": "Title2",
            "question": "Question2",
            "context": "Context 2 ...",
            "answers": [
              {"text": "text1", "probability": 0.891726, "start_logit": 4.075, "end_logit": 4.15},
              {"text": "text2", "probability": 0.091726, "start_logit": 2.075, "end_logit": 1.15},
              {"text": "text3", "probability": 0.051726, "start_logit": 1.075, "end_logit": 0.854}
            ]
        }
    ]
}

首先,在read_squad_example函数(run_squad.py的第227行)中,BERT将SQuAD json文件(输入文件)读取到SquadExample列表中,该文件包含前四个字段,我需要:id,标题,问题和上下文。

然后将SquadExamples转换为要素,然后可以开始write_predictions阶段(第741行)。

write_predictions BERT中,编写一个名为nbest_predictions.json的输出文件,其中包含具有特定概率的特定上下文的所有可能答案。

行891-898 上,我附加了我需要的最后四个字段(文本,概率,start_logit,end_logit):

nbest_json = []
    for (i, entry) in enumerate(nbest):
      output = collections.OrderedDict()
      output["text"] = entry.text
      output["probability"] = probs[i]
      output["start_logit"] = entry.start_logit
      output["end_logit"] = entry.end_logit
nbest_json.append(output)

输出文件nbest_predictions.json具有以下结构:

{
    "ID-1": [
        {
            "text": "text1", 
            "probability": 0.3617, 
            "start_logit": 4.0757, 
            "end_logit": 4.1554
        }, {
            "text": "text2", 
            "probability": 0.0036, 
            "start_logit": -0.5180, 
            "end_logit": 4.1554
        }
    ], 
    "ID-2": [
        {
            "text": "text1", 
            "probability": 0.2487, 
            "start_logit": -1.6009, 
            "end_logit": -0.2818
        }, {
            "text": "text2", 
            "probability": 0.0070, 
            "start_logit": -0.9566, 
            "end_logit": -1.5770
        }
    ]
}

现在... 我不完全了解nbest_predictions文件是如何生成的。如何编辑此功能并获取按照我在文章开头所指示的结构的json文件?

考虑到这一点,我认为我有两种可能性:

  1. 创建一个新的数据结构并附加我需要的字段。
  2. 编辑write_predictions函数以按照我想要的方式构造nbest_predictions.json

什么是最佳解决方案?

当前,我编写了一个新函数来读取输入文件,并将我的ID,标题,问题和上下文附加到数据结构中:

def read_squad_examples2(input_file, is_training):
  # SQUAD json file to list of SquadExamples #
  with tf.gfile.Open(input_file, "r") as reader:
    input_data = json.load(reader)["data"]

  def is_whitespace(c):
    if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
      return True
    return False

  data = {}
  sup_data = [] 

  for entry in input_data:
    entry_title = entry["title"]
    data["title"] = entry_title;
    for paragraph in entry["paragraphs"]:
      paragraph_text = paragraph["context"]
      data["context"] = paragraph_text;
      for qa in paragraph["qas"]:
        qas_id = qa["id"]
        data["id"] = qas_id;
        question_text = qa["question"]
        data["question"] = question_text

        sup_data.append(data)

  my_json = json.dumps(sup_data)

  return my_json

我得到的是:

[{
    "question": "Question 1?",
    "id": "ID 1 ",
    "context": "The context 1",
    "title": "Title 1"
}, {
    "question": "Question 2?",
    "id": "ID 2 ",
    "context": "The context 2",
    "title": "Title 2"
}]

在这一点上,如何将包含“文本”,“概率”,“ start_logit”和“ end_logit”的字段answers附加到此数据结构?

谢谢。

0 个答案:

没有答案