JSON文件未读取熊猫

时间:2018-10-27 00:32:32

标签: python json pandas jupyter-notebook

我有一个具有音乐声学功能的JSON文件(大约1GB)。我正在尝试使用将其读入我的熊猫笔记本 dataf = "/home/work/my.json" d = json.load(open(dataf, 'r')) 它一直给我一个错误,说

  

额外数据:第2行第1列(字符499)

我知道第499个字符是下一首曲目的开始,但是我已经看过网上了,不确定如何读取它。 以下是数据示例。

  

{“ _ id”:{“ $ oid”:“ 5b2cff21aecd2a723459cd65”},“ id”:1,“ sp_id”:“ 0XLOf9LhyazPX9Ld8jPiUq”,“ danceability”:0.7079999999999999627,“ energy”:0.60999999999999998668,“ key”: 2“,”响度“:-4.5220000000000002416,”模式“:” 1“,”语音“:0.057399999999999999634,”声音“:0.020400000000000001465,”乐器度“:4.4499999999999997457e-06,”活力“:0.064100000000000004197,”价“:0.30499999999999999334 ,“速度”:123.0379999999999967,“时间签名”:“ 4”,“ track_uri”:“ spotify:轨道:0XLOf9LhyazPX9Ld8jPiUq”}   {“ _id”:{“ $ oid”:“ 5b2cff21aecd2a723459cd66”},“ id”:2,“ sp_id”:“ 7aF09WaavZAmAWuUeYxlYD”,“ danceability”:0.59299999999999997158,“ energy”:0.86799999999999999378,“ key”:“ 1 “响度”:-3.5729999999999999538,“模式”:“ 0”,“语音”:0.29499999999999998446,“声学”:0.182999999999999996,“乐器性”:0.0,“活跃度”:0.36499999999999999112,“价”:0.49599999999999999645,“速度”:104.9879999 ,“ time_signature”:“ 4”,“ track_uri”:“ spotify:track:7aF09WaavZAmAWuUeYxlYD”}   {“ _id”:{“ $ oid”:“ 5b2cff21aecd2a723459cd67”},“ id”:3,“ sp_id”:“ 0tKcYR2II1VCQWT79i5NrW”,“ danceability”:0.5999999999999999778,“ energy”:0.81000000000000005329,“ key”:“ 0”, “响度”:-4.748999999999999666,“模式”:“ 1”,“语音”:0.047899999999999998135,“声音”:0.0068300000000000001335,“乐器性”:0.20999999999999999223,“活力”:0.15499999999999999889,“价”:0.297999999999999987545,“温度”: ,“ time_signature”:“ 4”,“ track_uri”:“ spotify:track:0tKcYR2II1VCQWT79i5NrW”}   {“ _id”:{“ $ oid”:“ 5b2cff21aecd2a723459cd68”},“ id”:4,“ sp_id”:“ 6TWSVHx6z6E42JiwloGv1k”,“ danceability”:0.50300000000000000266,“ energy”:0.91800000000000003819,“ key”:“ 11”, “响度”:-5.0099999999999997868,“模式”:“ 1”,“语音”:0.046399999999999996803,“声学”:0.016199999999999999123,“乐器性”:0.024400000000000001549,“活力”:0.18599999999999999867,“价”:0.417999999999999980.0,“温度” ,“ time_signature”:“ 4”,“ track_uri”:“ spotify:track:6TWSVHx6z6E42JiwloGv1k”}   {“ _id”:{“ $ oid”:“ 5b2cff21aecd2a723459cd69”},“ id”:5,“ sp_id”:“ 5QqyRUZeBE04yJxsD1OC0I”,“ danceability”:0.76000000000000000888,“ energy”:0.56100000000000005418,“ key”:“ 1”, “响度”:-8.6969999999999991758,“模式”:“ 1”,“语音”:0.13400000000000000799,“声学”:0.018499999999999999084,“乐器性”:1.9400000000000000604e-05,“活力”:0.19900000000000001021,“价”:0.12099999999999999645,“温度” “:134.98300000000000409,” time_signature“:” 4“,” track_uri“:” spotify:track:5QqyRUZeBE04yJxsD1OC0I“}

2 个答案:

答案 0 :(得分:2)

您的JSON无法解析,因为它是无效的JSON。解析器抱怨的字符就在第一个换行符之后。显然,有一些对象逐行转储到文件中,这些对象不构成有效对象。参见:

>>> json.loads(s[:499])
{'_id': {'$oid': '5b2cff21aecd2a723459cd65'},
 'id': 1,
 'sp_id': '0XLOf9LhyazPX9Ld8jPiUq',
 'danceability': 0.708,
 'energy': 0.61,
 'key': '2',
 'loudness': -4.522,
 'mode': '1',
 'speechiness': 0.0574,
 'acousticness': 0.0204,
 'instrumentalness': 4.45e-06,
 'liveness': 0.0641,
 'valence': 0.305,
 'tempo': 123.038,
 'time_signature': '4',
 'track_uri': 'spotify:track:0XLOf9LhyazPX9Ld8jPiUq'}
>>> json.loads(s[499:973])
{'_id': {'$oid': '5b2cff21aecd2a723459cd66'},
 'id': 2,
 'sp_id': '7aF09WaavZAmAWuUeYxlYD',
 'danceability': 0.593,
 'energy': 0.868,
 'key': '1',
 'loudness': -3.573,
 'mode': '0',
 'speechiness': 0.295,
 'acousticness': 0.183,
 'instrumentalness': 0.0,
 'liveness': 0.365,
 'valence': 0.496,
 'tempo': 104.988,
 'time_signature': '4',
 'track_uri': 'spotify:track:7aF09WaavZAmAWuUeYxlYD'}

({s是您的示例输入,被加载到字符串中。)这些对象一个接一个地打印到文件中。您必须更改语法,以使其成为对象列表(添加方括号和逗号),或者逐行解析文件,在输入的每一行上调用json.loads

现在,不要在此引用我,但是对输入进行修改以使其成为有效的JSON非常容易:

>>> len(json.loads('[' + s.replace('\n', ',') + ']'))
5

如果文件很大,您可能不希望在一次就坐时进行上述破解并进行解析,因为这会产生巨大的内存开销。在这种情况下,建议您逐个对象解析文件。假设您的文件每行包含一个对象,则只需要

dat = [json.loads(line) for line in open(infile)]

其中infile是您的串联JSON文件的路径。一个大文件将花费很长时间,并且结果将占用大量内存,但是我希望用于解析的额外开销会减少。

答案 1 :(得分:1)

好像您正在从MongoDB数据库中读取记录。 出现的是逐行存储的JSON对象数组,这意味着它本身不是有效的JSON对象,如@Andras

所指出的那样

相反,从MongoDB读取数据似乎要高效得多。

您可以像这样使用PyMongo:

import pandas as pd
from pymongo import MongoClient

mdbClient = MongoClient('mongodb://localhost:27017/')
db = mdbClient['db']
collection = db['col']

results = collection.find({})
df = pd.DataFrame.from_records(results)