如何将两个值从嵌套数据示例中提取到pandas Dataframe中?

时间:2019-10-07 11:44:43

标签: python pandas dictionary

我正在使用Standford的Dataset(请参阅Dev Set 2.0)。该文件为JSON格式。读取文件时,它是一个字典,但是我将其更改为DF:

import json
json_file = open("dev-v2.0.json", "r")
json_data = json.load(json_file)
json_file.close()

df = pd.DataFrame.from_dict(json_data)
df = df[0:2] # for this example, only a subset

我需要的所有信息都在 df ['data'] 列中。每行中都有如此格式的数据:

{'title': 'Normans', 'paragraphs': [{'qas': [{'question': 'In what country is Normandy located?', 'id': '56ddde6b9a695914005b9628', 'answers': [{'text': 'France', 'answer_start': 159}, {'text': 'France', 'answer_start': 159}, {'text': 'France', 'answer_start': 159}, {'text': 'France', 'answer_start': 159}], 'is_impossible': False}, {'question': 'When were the Normans in Normandy?', 'id': '56ddde6b9a695914005b9629', 'answers': [{'text': '10th and 11th centuries', 'answer_start': 94}, {'text': 'in the 10th and 11th centuries', 'answer_start': 87}

我想从DF中的所有行中查询所有问题和答案。 所以理想情况下,输出是这样的:

Question                                         Answer 
'In what country is Normandy located?'          'France'
'When were the Normans in Normandy?'            'in the 10th and 11th centuries'

抱歉!我已经阅读了'Good example'帖子。但是我发现很难为该示例生成可重现的数据,因为它看起来像是一本字典,里面有一个列表,列表中有一个小字典,在另一个字典中,然后又是一个字典...当我使用< strong> print(df [“ data”]),它只打印一小部分...(这无助于重现此问题)。

print(df['data'])
0    {'title': 'Normans', 'paragraphs': [{'qas': [{...
1    {'title': 'Computational_complexity_theory', '...
Name: data, dtype: object

非常感谢!

2 个答案:

答案 0 :(得分:1)

这应该使您入门。

不确定答案字段为空时如何处理情况,因此您可能想提出一个更好的解决方案。示例:

"question": " After 1945, what challenged the British empire?", "id": "5ad032b377cf76001a686e0d", "answers": [], "is_impossible": true

import json
import pandas as pd 


with open("dev-v2.0.json", "r") as f:
    data = json.loads(f.read())

questions, answers = [], []

for i in range(len(data["data"])):
    for j in range(len(data["data"][i]["paragraphs"])):
        for k in range(len(data["data"][i]["paragraphs"][j]["qas"])):
            q = data["data"][i]["paragraphs"][j]["qas"][k]["question"]
            try: # only takes first element since the rest of values are duplicated?
                a = data["data"][i]["paragraphs"][j]["qas"][k]["answers"][0]["text"]
            except IndexError: # when `"answers": []`
                a = "None"

            questions.append(q)
            answers.append(a)

d = {
    "Questions": questions,
    "Answers": answers
}

pd.DataFrame(d)

                                               Questions                      Answers
0                   In what country is Normandy located?                       France
1                     When were the Normans in Normandy?      10th and 11th centuries
2          From which countries did the Norse originate?  Denmark, Iceland and Norway
3                              Who was the Norse leader?                        Rollo
4      What century did the Normans first gain their ...                 10th century
...                                                  ...                          ...
11868  What is the seldom used force unit equal to on...                       sthène
11869           What does not have a metric counterpart?                         None
11870  What is the force exerted by standard gravity ...                         None
11871  What force leads to a commonly used unit of mass?                         None
11872        What force is part of the modern SI system?                         None

[11873 rows x 2 columns]

答案 1 :(得分:1)

以下page(将SQuAD(Stanford Q&A)json转换为Pandas DataFrame)处理将dev-v1.1.json转换为DataFrame。