我正在使用Standford的Dataset(请参阅Dev Set 2.0)。该文件为JSON格式。读取文件时,它是一个字典,但是我将其更改为DF:
import json
json_file = open("dev-v2.0.json", "r")
json_data = json.load(json_file)
json_file.close()
df = pd.DataFrame.from_dict(json_data)
df = df[0:2] # for this example, only a subset
我需要的所有信息都在 df ['data'] 列中。每行中都有如此格式的数据:
{'title': 'Normans', 'paragraphs': [{'qas': [{'question': 'In what country is Normandy located?', 'id': '56ddde6b9a695914005b9628', 'answers': [{'text': 'France', 'answer_start': 159}, {'text': 'France', 'answer_start': 159}, {'text': 'France', 'answer_start': 159}, {'text': 'France', 'answer_start': 159}], 'is_impossible': False}, {'question': 'When were the Normans in Normandy?', 'id': '56ddde6b9a695914005b9629', 'answers': [{'text': '10th and 11th centuries', 'answer_start': 94}, {'text': 'in the 10th and 11th centuries', 'answer_start': 87}
我想从DF中的所有行中查询所有问题和答案。 所以理想情况下,输出是这样的:
Question Answer
'In what country is Normandy located?' 'France'
'When were the Normans in Normandy?' 'in the 10th and 11th centuries'
抱歉!我已经阅读了'Good example'帖子。但是我发现很难为该示例生成可重现的数据,因为它看起来像是一本字典,里面有一个列表,列表中有一个小字典,在另一个字典中,然后又是一个字典...当我使用< strong> print(df [“ data”]),它只打印一小部分...(这无助于重现此问题)。
print(df['data'])
0 {'title': 'Normans', 'paragraphs': [{'qas': [{...
1 {'title': 'Computational_complexity_theory', '...
Name: data, dtype: object
非常感谢!
答案 0 :(得分:1)
这应该使您入门。
不确定答案字段为空时如何处理情况,因此您可能想提出一个更好的解决方案。示例:
"question": " After 1945, what challenged the British empire?", "id": "5ad032b377cf76001a686e0d", "answers": [], "is_impossible": true
import json
import pandas as pd
with open("dev-v2.0.json", "r") as f:
data = json.loads(f.read())
questions, answers = [], []
for i in range(len(data["data"])):
for j in range(len(data["data"][i]["paragraphs"])):
for k in range(len(data["data"][i]["paragraphs"][j]["qas"])):
q = data["data"][i]["paragraphs"][j]["qas"][k]["question"]
try: # only takes first element since the rest of values are duplicated?
a = data["data"][i]["paragraphs"][j]["qas"][k]["answers"][0]["text"]
except IndexError: # when `"answers": []`
a = "None"
questions.append(q)
answers.append(a)
d = {
"Questions": questions,
"Answers": answers
}
pd.DataFrame(d)
Questions Answers
0 In what country is Normandy located? France
1 When were the Normans in Normandy? 10th and 11th centuries
2 From which countries did the Norse originate? Denmark, Iceland and Norway
3 Who was the Norse leader? Rollo
4 What century did the Normans first gain their ... 10th century
... ... ...
11868 What is the seldom used force unit equal to on... sthène
11869 What does not have a metric counterpart? None
11870 What is the force exerted by standard gravity ... None
11871 What force leads to a commonly used unit of mass? None
11872 What force is part of the modern SI system? None
[11873 rows x 2 columns]
答案 1 :(得分:1)
以下page(将SQuAD(Stanford Q&A)json转换为Pandas DataFrame)处理将dev-v1.1.json转换为DataFrame。