如何加快在Python中加载和读取JSON文件的过程?

时间:2014-12-10 17:41:12

标签: python json

我正在运行一个脚本(在多处理模式下),它从一堆JSON文件中提取一些参数,但目前它非常慢。这是脚本:

from __future__ import print_function, division
import os
from glob import glob
from os import getpid
from time import time
from sys import stdout
import resource
from multiprocessing import Pool
import subprocess
try:
    import simplejson as json
except ImportError:
    import json


path = '/data/data//*.A.1'
print("Running with PID: %d" % getpid())

def process_file(file):
    start = time()
    filename =file.split('/')[-1]
    print(file)
    with open('/data/data/A.1/%s_DI' %filename, 'w') as w:
        with open(file, 'r') as f:
            for n, line in enumerate(f):
                d = json.loads(line)
                try:

                    domain = d['rrname']
                    ips = d['rdata']
                    for i in ips:
                        print("%s|%s" % (i, domain), file=w)
                except:
                    print (d)
                    pass

if __name__ == "__main__":
    files_list = glob(path)
    cores = 12
    print("Using %d cores" % cores)
    pp = Pool(processes=cores)
    pp.imap_unordered(process_file, files_list)
    pp.close()
    pp.join()

有没有人知道如何加快速度?

4 个答案:

答案 0 :(得分:1)

首先,找出你的瓶颈所在。

如果是在json解码/编码步骤,请尝试切换到ultrajson

  

UltraJSON是一个用纯C编写的超快速JSON编码器和解码器   绑定Python 2.5+和3。

更改将像更改导入部分一样简单:

try:
    import ujson as json
except ImportError:
    try:
        import simplejson as json
    except ImportError:
        import json

我还在What is faster - Loading a pickled dictionary object or Loading a JSON file - to a dictionary?做了一个简单的基准测试,看一看。

答案 1 :(得分:0)

我更新了脚本以尝试不同的实验,发现是的,json解析是cpu绑定的。我得到28MB / s,这比你的.04Gig每分钟(> 1 MB / s)更好,所以不确定那里发生了什么。当跳过json的东西并且只是写入文件时,我得到了996 MB / s。

在下面的代码中,您可以使用python slow.py create生成数据集,并通过更改标记为todo:的代码来测试多个方案。我的数据集只有800 MB,因此I / O被RAM缓存吸收(运行两次以确保要读取的文件已被缓存)。

我很惊讶json解码是如此密集。

from __future__ import print_function, division
import os
from glob import glob
from os import getpid
from time import time
from sys import stdout
import resource
from multiprocessing import Pool, cpu_count
import subprocess

# todo: pick your poison
#import json
#import ujson as json
import simplejson as json

import sys

# todo: choose your data path
#path = '/data/data//*.A.1'
#path = '/tmp/mytest'
path = os.path.expanduser('~/tmp/mytest')

# todo: choose your cores
#cores = 12
cores = cpu_count()

print("Running with PID: %d" % getpid())

def process_file(file):
    start = time()
    filename =file.split('/')[-1]
    print(file)
    with open(file + '.out', 'w', buffering=1024*1024) as w:
        with open(file, 'r', buffering=1024*1024) as f:
            for n, line in enumerate(f):

                # todo: for pure bandwidth calculations
                #w.write(line)
                #continue

                try:
                    d = json.loads(line)
                except Exception, e:
                    raise RuntimeError("'%s' in %s: %s" % (str(e), file, line))
                try:

                    domain = d['rrname']
                    ips = d['rdata']
                    for i in ips:
                        print("%s|%s" % (i, domain), file=w)
                except:
                    print (d, 'error')
                    pass
    return os.stat(file).st_size

def create_files(path, files, entries):
    print('creating files')
    extra = [i for i in range(32)]
    if not os.path.exists(path):
        os.makedirs(path)
    for i in range(files):
        fn = os.path.join(path, 'in%d.json' % i)
        print(fn)
        with open(fn, 'w') as fp:
            for j in range(entries):
                json.dump({'rrname':'fred', 
                     'rdata':[str(k) for k in range(10)],
                     'extra':extra},fp)
                fp.write('\n')


if __name__ == "__main__":
    if 'create' in sys.argv:
        create_files(path, 1000, 100000)
        sys.exit(0)
    files_list = glob(os.path.join(path, '*.json'))
    print('processing', len(files_list), 'files in', path)
    print("Using %d cores" % cores)
    pp = Pool(processes=cores)
    total = 0
    start = time()
    for result in pp.imap_unordered(process_file, files_list):
        total += result
    pp.close()
    pp.join()
    delta = time() - start
    mb = total/1000000
    print('%d MB total, %d MB/s' % (mb, mb/delta))

答案 2 :(得分:0)

答案 3 :(得分:0)

用于安装:

pip install orjson 

对于导入:

import orjson as json

这在您想转储或加载大尺寸数组时尤其有效。