Question

我在使用Pandas读取次级数据时遇到困难。

背景

我使用NYT Archive API下载一系列数据，我将其存储在一个JSON文件中，该文件实际上包含JSON对象列表。

步骤：

我使用read_json方法读取了JSON文件。

pandas_df = pd.read_json("data.json")

当我看到使用head的样本结果时，它看起来如下：

 pandas_df.head()
    copyright  \
0  Copyright (c) 2013 The New York Times Company....   
1  Copyright (c) 2013 The New York Times Company....   
2  Copyright (c) 2013 The New York Times Company....   
3  Copyright (c) 2013 The New York Times Company....   
4  Copyright (c) 2013 The New York Times Company....   

                                            response  
0  {'docs': [{'subsection_name': None, 'slideshow...  
1  {'docs': [{'subsection_name': None, 'slideshow...  
2  {'docs': [{'subsection_name': None, 'slideshow...  
3  {'docs': [{'subsection_name': None, 'slideshow...  
4  {'docs': [{'subsection_name': None, 'slideshow...

我只需要回复中的信息。所以当我改变下面的代码时：

print(pandas_df["response"].head())
0    {'docs': [{'subsection_name': None, 'slideshow...
1    {'docs': [{'subsection_name': None, 'slideshow...
2    {'docs': [{'subsection_name': None, 'slideshow...
3    {'docs': [{'subsection_name': None, 'slideshow...
4    {'docs': [{'subsection_name': None, 'slideshow...
Name: response, dtype: object

问题：

如何使用docs中的元素获取数据？像子部分，幻灯片等等。我能以表格格式看到它，比如数据框吗？

如果需要更多信息，请告诉我。

感谢。

编辑1：

从JSON文件添加第一个元素。这个文件在1GB左右太大了。

{
  "copyright": "Copyright (c) 2013 The New York Times Company.  All Rights Reserved.",
  "response": {
    "meta": {
      "hits": 7652
    },
    "docs": [
      {
        "web_url": "http://www.nytimes.com/interactive/2016/technology/personaltech/cord-cutting-guide.html",
        "snippet": "We teamed up with The Wirecutter to come up with cord-cutter bundles for movie buffs, sports addicts, fans of premium TV shows, binge watchers and families with children.",
        "lead_paragraph": "We teamed up with The Wirecutter to come up with cord-cutter bundles for movie buffs, sports addicts, fans of premium TV shows, binge watchers and families with children.",
        "abstract": null,
        "print_page": null,
        "blog": [],
        "source": "The New York Times",
        "multimedia": [
          {
            "width": 190,
            "url": "images/2016/10/13/business/13TECHFIX/06TECHFIX-thumbWide.jpg",
            "height": 126,
            "subtype": "wide",
            "legacy": {
              "wide": "images/2016/10/13/business/13TECHFIX/06TECHFIX-thumbWide.jpg",
              "wideheight": "126",
              "widewidth": "190"
            },
            "type": "image"
          },
          {
            "width": 600,
            "url": "images/2016/10/13/business/13TECHFIX/06TECHFIX-articleLarge.jpg",
            "height": 346,
            "subtype": "xlarge",
            "legacy": {
              "xlargewidth": "600",
              "xlarge": "images/2016/10/13/business/13TECHFIX/06TECHFIX-articleLarge.jpg",
              "xlargeheight": "346"
            },
            "type": "image"
          },
          {
            "width": 75,
            "url": "images/2016/10/13/business/13TECHFIX/06TECHFIX-thumbStandard.jpg",
            "height": 75,
            "subtype": "thumbnail",
            "legacy": {
              "thumbnailheight": "75",
              "thumbnail": "images/2016/10/13/business/13TECHFIX/06TECHFIX-thumbStandard.jpg",
              "thumbnailwidth": "75"
            },
            "type": "image"
          }
        ],
        "headline": {
          "main": "The Definitive Guide to Cord-Cutting in 2016, Based on Your Habits",
          "kicker": "Tech Fix"
        },
        "keywords": [
          {
            "rank": "1",
            "is_major": "N",
            "name": "subject",
            "value": "Video Recordings, Downloads and Streaming"
          },
          {
            "rank": "2",
            "is_major": "N",
            "name": "subject",
            "value": "Television Sets and Media Devices"
          },
          {
            "rank": "1",
            "is_major": "Y",
            "name": "subject",
            "value": "Television"
          }
        ],
        "pub_date": "2016-01-01T05:00:00Z",
        "document_type": "multimedia",
        "news_desk": "Technology / Personal Tech",
        "section_name": "Technology",
        "subsection_name": "Personal Tech",
        "byline": {
          "person": [
            {
              "firstname": "Brian",
              "middlename": "X.",
              "lastname": "CHEN",
              "rank": 1,
              "role": "reported",
              "organization": ""
            }
          ],
          "original": "By BRIAN X. CHEN"
        },
        "type_of_material": "Interactive Feature",
        "_id": "57fdfb9895d0e022439c2b57",
        "word_count": null,
        "slideshow_credits": null
      }]}}

Answer 1

您应该能够将嵌套在docs字典中的response列表下的所有元素提取到DataFrame中。

import json
with open('data.json') as f:
    data = json.load(f)
df = pd.DataFrame(data['response']['docs'])

使用Pandas读取子级JSON数据

1 个答案: