Question

我有一个500多个巨大的json文件，每个大小为400 MB，采用压缩格式（3 Gigs，未压缩时）。我在python 2.7中使用标准的json库来处理数据，并且花费的时间太多了，我认为json.loads()是时间消耗的主要罪魁祸首。我正在考虑在python中使用pandas从gzip文件加载数据并进行分析。

我刚刚听说过大熊猫，我不确定这是否适合使用。我担心的是，使用大熊猫会不会帮助我提高速度？

NB：当然，我可以将工作平行化并实现速度，但我仍然发现事情已经相当滞后。

此外，将通过阅读gzip.open()然后将json转换为带有json.loads()的字典然后存储在sqlite3中来添加数据，无论如何都会帮助我进行更深入的分析。

json条目样本：

 {"created_at":"Sun Dec 01 01:19:00 +0000 2013","id":406955558441193472,"id_str":"406955558441193472","text":"Todo va a estar bn :D","source":"\u003ca href=\"http:\/\/blackberry.com\/twitter\" rel=\"nofollow\"\u003eTwitter for BlackBerry\u00ae\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":483470963,"id_str":"483470963","name":"katheryn Rodriguez","screen_name":"katheryn_93","location":"","url":null,"description":"No pretendo ser nadie mas y no soy perfecta lo se, tengo muchos errores tambi\u00e9n lo se pero me acepto y me amo como soy.","protected":false,"followers_count":71,"friends_count":64,"listed_count":0,"created_at":"Sun Feb 05 02:04:16 +0000 2012","favourites_count":218,"utc_offset":-21600,"time_zone":"Central Time (US & Canada)","geo_enabled":true,"verified":false,"statuses_count":10407,"lang":"es","contributors_enabled":false,"is_translator":false,"profile_background_color":"DBE9ED","profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/378800000116209016\/ff11dc9f5a2e05d2800a91cff08c2c73.jpeg","profile_background_image_url_https":"https:\/\/si0.twimg.com\/profile_background_images\/378800000116209016\/ff11dc9f5a2e05d2800a91cff08c2c73.jpeg","profile_background_tile":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/378800000736604157\/b6d36df6332a2cacb0d30b5328b668d6_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/378800000736604157\/b6d36df6332a2cacb0d30b5328b668d6_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/483470963\/1385144720","profile_link_color":"9D1DCF","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"E6F6F9","profile_text_color":"333333","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"es"}

偶尔你可以找到类似的json条目：

{"delete":{"status":"id":380315814080937984,"user_id":318430801,"id_str":"380315814080937984","user_id_str":"318430801"}}}

Answer 1

当在python中存储为嵌套dicts时，3 GB json文件将是巨大的，很可能大许多倍，因此使用了大量内存。观察在加载其中一个文件期间内存使用量如何增加，您很可能会注意到您的计算机开始交换。

您需要将每一行解析为json（如果是）或将文件拆分为更小的块。

使用pandas加载巨大的json文件

1 个答案: