我在使用Pandas读取次级数据时遇到困难。
背景
我使用NYT Archive API下载一系列数据,我将其存储在一个JSON文件中,该文件实际上包含JSON对象列表。
步骤:
我使用read_json方法读取了JSON文件。
pandas_df = pd.read_json("data.json")
当我看到使用head的样本结果时,它看起来如下:
pandas_df.head()
copyright \
0 Copyright (c) 2013 The New York Times Company....
1 Copyright (c) 2013 The New York Times Company....
2 Copyright (c) 2013 The New York Times Company....
3 Copyright (c) 2013 The New York Times Company....
4 Copyright (c) 2013 The New York Times Company....
response
0 {'docs': [{'subsection_name': None, 'slideshow...
1 {'docs': [{'subsection_name': None, 'slideshow...
2 {'docs': [{'subsection_name': None, 'slideshow...
3 {'docs': [{'subsection_name': None, 'slideshow...
4 {'docs': [{'subsection_name': None, 'slideshow...
我只需要回复中的信息。所以当我改变下面的代码时:
print(pandas_df["response"].head())
0 {'docs': [{'subsection_name': None, 'slideshow...
1 {'docs': [{'subsection_name': None, 'slideshow...
2 {'docs': [{'subsection_name': None, 'slideshow...
3 {'docs': [{'subsection_name': None, 'slideshow...
4 {'docs': [{'subsection_name': None, 'slideshow...
Name: response, dtype: object
问题:
如何使用docs中的元素获取数据?像子部分,幻灯片等等。我能以表格格式看到它,比如数据框吗?
如果需要更多信息,请告诉我。
感谢。
编辑1:
从JSON文件添加第一个元素。这个文件在1GB左右太大了。
{
"copyright": "Copyright (c) 2013 The New York Times Company. All Rights Reserved.",
"response": {
"meta": {
"hits": 7652
},
"docs": [
{
"web_url": "http://www.nytimes.com/interactive/2016/technology/personaltech/cord-cutting-guide.html",
"snippet": "We teamed up with The Wirecutter to come up with cord-cutter bundles for movie buffs, sports addicts, fans of premium TV shows, binge watchers and families with children.",
"lead_paragraph": "We teamed up with The Wirecutter to come up with cord-cutter bundles for movie buffs, sports addicts, fans of premium TV shows, binge watchers and families with children.",
"abstract": null,
"print_page": null,
"blog": [],
"source": "The New York Times",
"multimedia": [
{
"width": 190,
"url": "images/2016/10/13/business/13TECHFIX/06TECHFIX-thumbWide.jpg",
"height": 126,
"subtype": "wide",
"legacy": {
"wide": "images/2016/10/13/business/13TECHFIX/06TECHFIX-thumbWide.jpg",
"wideheight": "126",
"widewidth": "190"
},
"type": "image"
},
{
"width": 600,
"url": "images/2016/10/13/business/13TECHFIX/06TECHFIX-articleLarge.jpg",
"height": 346,
"subtype": "xlarge",
"legacy": {
"xlargewidth": "600",
"xlarge": "images/2016/10/13/business/13TECHFIX/06TECHFIX-articleLarge.jpg",
"xlargeheight": "346"
},
"type": "image"
},
{
"width": 75,
"url": "images/2016/10/13/business/13TECHFIX/06TECHFIX-thumbStandard.jpg",
"height": 75,
"subtype": "thumbnail",
"legacy": {
"thumbnailheight": "75",
"thumbnail": "images/2016/10/13/business/13TECHFIX/06TECHFIX-thumbStandard.jpg",
"thumbnailwidth": "75"
},
"type": "image"
}
],
"headline": {
"main": "The Definitive Guide to Cord-Cutting in 2016, Based on Your Habits",
"kicker": "Tech Fix"
},
"keywords": [
{
"rank": "1",
"is_major": "N",
"name": "subject",
"value": "Video Recordings, Downloads and Streaming"
},
{
"rank": "2",
"is_major": "N",
"name": "subject",
"value": "Television Sets and Media Devices"
},
{
"rank": "1",
"is_major": "Y",
"name": "subject",
"value": "Television"
}
],
"pub_date": "2016-01-01T05:00:00Z",
"document_type": "multimedia",
"news_desk": "Technology / Personal Tech",
"section_name": "Technology",
"subsection_name": "Personal Tech",
"byline": {
"person": [
{
"firstname": "Brian",
"middlename": "X.",
"lastname": "CHEN",
"rank": 1,
"role": "reported",
"organization": ""
}
],
"original": "By BRIAN X. CHEN"
},
"type_of_material": "Interactive Feature",
"_id": "57fdfb9895d0e022439c2b57",
"word_count": null,
"slideshow_credits": null
}]}}
答案 0 :(得分:0)
您应该能够将嵌套在docs
字典中的response
列表下的所有元素提取到DataFrame中。
import json
with open('data.json') as f:
data = json.load(f)
df = pd.DataFrame(data['response']['docs'])