使用ijson python

时间:2017-05-11 08:50:17

标签: python mysql json python-2.7 ijson

我遇到了几个关于ijson在python中加载巨大JSON文件的线程,因为这是不消耗所有内存的方法。

我的文件大小约为1.4 GB,它有几个节点(见下图),我只对一个容纳大部分数据的节点感兴趣(c_driver_location)。

JSON_1.4GB

我的目标是:我只想提取c_driver_location节点数据并将其插入到mysql db表中(它将有四列:id,经度,纬度,时间戳)。

表格ddl:

  

创建表drv_locations_backup7May2017   (id bigint unsigned auto_increment主键,   drv_fb_id varchar(50),       纬度DECIMAL(10,8)NOT NULL,       经度DECIMAL(11,8)NOT NULL,       timestamp int)

我的问题是:我运行附加代码的第一部分(直到连接到mysql之前),但是它运行了20个小时仍然没有完成解析json。 (我在较小的文件上测试过,它工作正常)。

有没有一种最佳方法可以让它更快更有效?

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import ijson
import pymysql.cursors
import pymysql


filename = "D:\json_file.json"
drv_col_list = ['drv_fb_id','latitude','longitude','timestamp']
drv_df = DataFrame(columns = drv_col_list)
drv_df.timestamp = drv_df.timestamp.astype(int)

counter = 0
with open(filename, 'r') as fd:
    parser = ijson.parse(fd)
    for prefix, event, value in parser:
        if prefix == 'c_driver_location' and str(event) == 'map_key':
            drv_fb_id = value
            counter = counter + 1
        elif prefix.endswith('.latitude'):
            latitude = value
        elif prefix.endswith('.longitude'):
            longitude = value
        elif prefix.endswith('.timestamp'):
            timestamp = value
        elif prefix.endswith(drv_fb_id) and str(event) == 'end_map':
            drv_df = drv_df.append(pd.DataFrame({'drv_fb_id':drv_fb_id,'latitude':latitude,'longitude':longitude,'timestamp':timestamp},index=[0]),ignore_index=True)
connection = pymysql.connect(host='53.000.00.00',
                             port = 3306,
                             user='user',
                             password='abcdefg',
                             db ='newdb',
                             # charset='utf8mb4',
                             cursorclass=pymysql.cursors.DictCursor)
# write to mysql 
drv_df.to_sql(con=connection, name='drv_locations_backup7May2017', if_exists='replace', flavor='mysql')                                               
connection.close()

1 个答案:

答案 0 :(得分:0)

您只需稍微修改代码即可生成数据转储。

import ijson


outfile = "D:\upload_data.txt"
filename = "D:\json_file.json"
drv_col_list = ['drv_fb_id','latitude','longitude','timestamp']
timestamp = drv_df.timestamp.astype(int)


ofile = open(outfile, "rw")

counter = drv_fb_id = latitude = longitude = 0
with open(filename, 'r') as fd:
    parser = ijson.parse(fd)
    for prefix, event, value in parser:
        if prefix == 'c_driver_location' and str(event) == 'map_key':
            drv_fb_id = value
            counter = counter + 1
        elif prefix.endswith('.latitude'):
            latitude = value
        elif prefix.endswith('.longitude'):
            longitude = value
        elif prefix.endswith('.timestamp'):
            timestamp = value
        elif prefix.endswith(drv_fb_id) and str(event) == 'end_map':
            print >>ofile, ",".join(map(str, [drv_fb_id, latitude, longitude, timestamp]))           

close(ofile)

现在,您在D:\ upload_data.txt

中有逗号分隔的输出

代码未经测试。

我目前没有测试mysql数据库。我相信mysql manual is easy to follow。你的表结构并不是很复杂。