如何为问答神经网络构建训练和测试集?

时间:2018-08-01 14:34:13

标签: json python-3.x split neural-network

我遵循了tutorial on neural networks,在阅读了项目DrQA Facebook Research Team之后,我想要一个长达一周的项目,以使本教程适应问题和答案。 我有一组数据,其中包含来自上下文的问题,上下文和答案,我想知道是否可以制作一个将为神经网络创建load_data ()的Python脚本train_x_orig, train_y, test_x_orig, test_y,。 数据集来自dev set of Stanford Question Answering Dataset SQuAD2.0

{
  "version": "v2.0",
  "data": [
    {
      "title": "Normans",
      "paragraphs": [
        {
          "qas": [
            {
              "question": "In what country is Normandy located?",
              "id": "56ddde6b9a695914005b9628",
              "answers": [
                {
                  "text": "France",
                  "answer_start": 159
                },
                {
                  "text": "France",
                  "answer_start": 159
                },
                {
                  "text": "France",
                  "answer_start": 159
                },
                {
                  "text": "France",
                  "answer_start": 159
                }
              ],
              "is_impossible": false
            },
            ...
            {
              "plausible_answers": [
                {
                  "text": "10th century",
                  "answer_start": 671
                }
              ],
              "question": "When did the Frankish identity emerge?",
              "id": "5ad39d53604f3c001a3fe8d4",
              "answers": [],
              "is_impossible": true
            }
          ],
          "context": "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries."
        },
...
        }
      ]
    }
  ]
}

使用图形表示进行总结,这是json文件的外观。

![json file description

该文件的根名为datadata包含文章Wikipedia,titleparagraphs的标题。在paragraphs中,有qas用于“问题和答案”。 qas中的上下文...

我想这样得到:

train_x_orig, train_y, test_x_orig, test_y, classes = load_data()

使用train_x_origtest_x_orig问题和段落以及train_ytest_y答案。

尝试

直到现在我还是这样做:

# for train and test
from sklearn.model_selection import train_test_split

# for the file
import json

def load_data():

    with open('dev-v2.0.json') as f:
        source = json.load(f)

    contexts_questions = []
    answers = []

    # We extract and create a line of context question and answer
    for data in source['data']:
        for paragraphs in data['paragraphs']:
            context = paragraphs['context']
            for qas in paragraphs['qas']:
                question = qas['question']
                if qas['answers']:
                    answer = qas['answers']
                elif qas['plausible_answers']:
                    answer = qas['plausible_answers']
                contexts_questions.append([context, question])
                answers.append([answer])

    # split in train and test sets
    train_x_orig, test_x_orig, train_y, test_y = train_test_split(contexts_questions,answers)
    return train_x_orig, train_y, test_x_orig, test_y

它似乎工作正常,但是我仍然没有设法为文本中未解决的问题找到一种体系结构。

index = 10
print("Context:")
print (train_x_orig[index][0])
print("Question:")
print (train_x_orig[index][1])
print("Answer:")
print (train_y[index])

En efecto como puede verloaquí:

Context:
Gasquet (1908) claimed that the Latin name atra mors (Black Death) for the 14th-century epidemic first appeared in modern times in 1631 in a book on Danish history by J.I. Pontanus...
Question:
In what year was J.I. Pontanus born?
Answer:
[[{'answer_start': 9, 'text': '1908'}]]

答案不是一个好答案。问题是"is_impossible": "true"

0 个答案:

没有答案