嘿,我是新手使用JSON文件
我有一个样本JSON
这样的文件(twitter数据抓取)
这里每个[]
包含多个JSON对象,我想从这些对象中提取文本,问题是json.load
不能处理多个JSON数组结构([][][])
在下面的示例示例中,第一个[]
包含三个JSON对象,第二个包含两个JSON对象
[{
"created_at": "2014-12-07 02:17:18",
"favorite_count": 5,
"id_str": "541416129567019008",
"in_reply_to_screen_name": "panellington",
"retweet_count": 15,
"retweeted": false,
"text": "minutes ago. #berkeley #BlackLivesMatter #EricGarner #LAPD"
},
{
"created_at": "2014-12-04 19:21:13",
"favorite_count": 13,
"id_str": "540586640595369984",
"in_reply_to_screen_name": null,
"retweet_count": 38,
"retweeted": false,
"text": "#BlackLivesMatter"
},
{
"created_at": "2014-12-13 00:50:27",
"favorite_count": 1,
"id_str": "543568596299808768",
"in_reply_to_screen_name": null,
"retweet_count": 0,
"retweeted": false,
"text": "MLK Riot is language of unheard #Ferguson #ICantBreathe #BlackLivesMatter"
}]
[{
"created_at": "2015-04-28 13:21:35",
"favorite_count": 0,
"id_str": "593042377658519552",
"in_reply_to_screen_name": null,
"retweet_count": 5,
"retweeted": false,
"text": "RT @fsmith827: A lot of folks speaking against civil unrest have been willfully blind, willfully silent @ #BlackLivesMatter & #PoliceBrutal\u2026",
},
{
"created_at": "2014-12-07 03:17:27",
"favorite_count": 0,
"id_str": "541431264897937408",
"in_reply_to_screen_name": null,
"retweet_count": 456,
"retweeted": false,
"text": "RT @thecrisismag: #ICantBreathe Protesters in Paris march in solidarity with #EricGarner and #MikeBrown #BlackLivesMatter #GrandJury"
}]
我希望将其作为JSON对象阅读并希望对此进行处理(例如:data[0]['text']
)
问题是我的文件包含多个JSON数组对象[一些随机数量的JSON对象],[一些随机数量的JSON对象]等等。
with open('tweets.json') as json_data:
d = json.load(json_data)
由于这个[][][]
...结构,json.load无法正常工作
## error raised
Error:
raise JSONDecodeError("Extra data", s, end)
JSONDecodeError: Extra data
答案 0 :(得分:0)
也许我不清楚你在这里要做什么。但看起来你只想迭代你的json字符串列表。像这样:
In [1]: import json
In [2]: json_data = ['{ "created_at": "2014-12-07 02:17:18", "favorite_count": 5, "id_str": "541416129567019008", "in_reply_to_screen_name": "panellington", "retweet
...: _count": 15, "retweeted": false, "text": "minutes ago. #berkeley #BlackLivesMatter #EricGarner #LAPD" }', '{ "created_at": "2014-12-04 19:21:13", "favorite_c
...: ount": 13, "id_str": "540586640595369984", "in_reply_to_screen_name": null, "retweet_count": 38, "retweeted": false, "text": "#BlackLivesMatter" }', '{ "crea
...: ted_at": "2014-12-13 00:50:27", "favorite_count": 1, "id_str": "543568596299808768", "in_reply_to_screen_name": null, "retweet_count": 0, "retweeted": false,
...: "text": "MLK Riot is language of unheard #Ferguson #ICantBreathe #BlackLivesMatter" }']
In [3]: for tweet in json_data:
...: print(json.loads(tweet))
...:
{'created_at': '2014-12-07 02:17:18', 'favorite_count': 5, 'id_str': '541416129567019008', 'in_reply_to_screen_name': 'panellington', 'retweet_count': 15, 'retweeted': False, 'text': 'minutes ago. #berkeley #BlackLivesMatter #EricGarner #LAPD'}
{'created_at': '2014-12-04 19:21:13', 'favorite_count': 13, 'id_str': '540586640595369984', 'in_reply_to_screen_name': None, 'retweet_count': 38, 'retweeted': False, 'text': '#BlackLivesMatter'}
{'created_at': '2014-12-13 00:50:27', 'favorite_count': 1, 'id_str': '543568596299808768', 'in_reply_to_screen_name': None, 'retweet_count': 0, 'retweeted': False, 'text': 'MLK Riot is language of unheard #Ferguson #ICantBreathe #BlackLivesMatter'}