从嵌套的JSON生成热图

时间:2019-02-07 10:37:06

标签: python pandas heatmap

我正在尝试根据给定here的SQUAD v1.1数据集生成热图。


小队数据集如下所示:

Document/
├── Paragraph1/
│   ├── Question
│   ├── Answer1
│   ├── Answer2
│   └── Answer3
├── Paragraph2/
│   ├── Question
│   └── Answer1

文档可能具有多个段落/上下文。每个段落(上下文)可能有多个问题和答案。其描述为here

我正计划将JSON规范化为CSV,这可能是错误的:

Context,Question,Answer
Context1,Question1,Answer1
Context1,Question1,Answer2
Context1,Question2,Answer1
...

到目前为止,我已使用以下代码将嵌套的JSON标准化为CSV文件:

import json
import csv

with open(r'SQUAD v1.json') as squad_data_file_handle:
    squad_data = json.load(squad_data_file_handle)

with open('SQUAD_11_CSV.csv', 'w', newline='', encoding='UTF-8') as squad_csv_handle:
    writer = csv.writer(squad_csv_handle, dialect='excel', delimiter=',')
    writer.writerow(["Context", "Question", "Answer"])
    for data in squad_data["data"]:
        for paragraph in data["paragraphs"]:
            context = str(paragraph["context"])
            question_answer_pairs = paragraph.get("qas", [])

            for qa_pair in question_answer_pairs:
                    question = str(qa_pair["question"])
                    answers = list(set([str(answer.get("text")) for answer in qa_pair.get("answers", [])]))
                    for answer in answers:
                        writer.writerow([context, question, answer])

因此其生成CSV的方式如下(前两行):

Context,Question,Answer
"Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the ""golden anniversary"" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as ""Super Bowl L""), so that the logo could prominently feature the Arabic numerals 50.",Which NFL team represented the AFC at Super Bowl 50?,Denver Broncos
"Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the ""golden anniversary"" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as ""Super Bowl L""), so that the logo could prominently feature the Arabic numerals 50.",Which NFL team represented the NFC at Super Bowl 50?,Carolina Panthers

这是我用来生成热图的代码:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv(r"SQUAD_11_CSV.csv")
df = df.pivot("Context", "Question", "Answer")
sns.heatmap(df)
plt.show()

因此,当我尝试生成热图时,会引发以下异常:

  

ValueError: Index contains duplicate entries, cannot reshape

因此,对于任何有关如何生成热图并将SQUAD JSON数据建模为完美CSV的明显错误的提示/指针,将不胜感激。


更新

热图必须看起来像这样:

enter image description here

0 个答案:

没有答案