如果父级不是最高元素,如何访问父级和子级JSON记录?

时间:2019-02-22 19:54:52

标签: json python-3.x pandas

我正在尝试使用Pandas加载SQuAD数据集。我的数据集中的JSON元素的结构如下,其中所有以“ s”结尾的内容都代表一个列表:

-data
-- title
-- paragraphs
-- context
--- qas
---- id
---- question
----- answers
------ answerStart
------ answerText

我想创建一个看起来像这样的DataFrame:

问题标题上下文answerText

但是,我只希望每个问题一个“ answerText”值,所以每个“ qas”字段仅一个答案。由于“ qas”具有每对唯一的ID,因此最好创建一个“ answers”数据框,然后再创建一个如下所示的数据框:

qas_id answer_id

但是,我不太确定如何最好地设置此架构。这是我尝试过的:

with open(filename) as file:
    data = json.load(file)["data"]
    questions = pd.io.json.json_normalize(data,record_path=["paragraphs","qas","question"],meta=["paragraphs","qas","id"])
    answers = pd.io.json.json_normalize(data,record_path=["paragraphs","qas","answers"],meta=["paragraphs","qas","id"])

既然meta显然只允许访问top元素的子元素,那么如何创建同时包含“ qas”的“ id”元素和答案的“ answerStart”和“ answerText”元素的数据框?

1 个答案:

答案 0 :(得分:0)

我相信我有一个可行的解决方案:

import json
import re
import string
import pandas as pd
def readFile(filename):
    with open(filename) as file:
        data = json.load(file)["data"]
        qas = pd.io.json.json_normalize(data,record_path=["paragraphs","qas"],meta=["title"])
        #print(qas["question"])
        #Gather a list of where all answers should be so we can shove them into a DataFrame.
        # Haven't found a more efficient way to do this yet.
        answer_ids = set()
        answerId = 0
        for index,row in qas.iterrows():
            answer_ids.add(answerId)
            answerId = answerId + len(row["answers"])
        print("Finished with answer ids.")
        # Map qas pair IDs to answer IDs.
        answer_ids = pd.DataFrame(list(answer_ids))
        print("Finished converting answer_ids to DataFrame.")
        question_answerId = pd.DataFrame(qas["question"]).join(answer_ids,how="outer")
        question_answerId.columns = ["question","answer_id"]
        #print("Id-answerID columns: ",id_answerId.columns)
        print("finished creating intermediary table.")
        # Load answers into a data frame.
        answers = pd.io.json.json_normalize(data,record_path=["paragraphs","qas","answers"])
        answers.rename(columns={"text":"answer_text"},inplace=True)
        # Give each answer an ID.
        answers["id"] = answers.index
        print("Finished creating answers dataframe.")
        qas = qas.drop(labels=["answers"],axis=1) # Not needed any longer; we have the answers!
        #print("Dropped column 'answers' from qas.")
        # Map qas dataframe to answer table via id_answerId
        qas_answerId = pd.merge(qas,question_answerId,how="inner",on="question")
        # Check that no duplicates exist in qas_answerId
        qas_answerId = qas_answerId.drop_duplicates("question")
        assert qas_answerId.duplicated("question").any() == False
        print("Finished joining qas to answer id")
        # Merge qas_answerId with answers.
        returnDataFrame = pd.merge(qas_answerId,answers,how="inner",left_on="answer_id",right_on="id")
        #print("Returned data frame: ",returnDataFrame)
        print("Done!")
        return returnDataFrame