我正在运行一个脚本(在多处理模式下),它从一堆JSON文件中提取一些参数,但目前它非常慢。这是脚本:
from __future__ import print_function, division
import os
from glob import glob
from os import getpid
from time import time
from sys import stdout
import resource
from multiprocessing import Pool
import subprocess
try:
import simplejson as json
except ImportError:
import json
path = '/data/data//*.A.1'
print("Running with PID: %d" % getpid())
def process_file(file):
start = time()
filename =file.split('/')[-1]
print(file)
with open('/data/data/A.1/%s_DI' %filename, 'w') as w:
with open(file, 'r') as f:
for n, line in enumerate(f):
d = json.loads(line)
try:
domain = d['rrname']
ips = d['rdata']
for i in ips:
print("%s|%s" % (i, domain), file=w)
except:
print (d)
pass
if __name__ == "__main__":
files_list = glob(path)
cores = 12
print("Using %d cores" % cores)
pp = Pool(processes=cores)
pp.imap_unordered(process_file, files_list)
pp.close()
pp.join()
有没有人知道如何加快速度?
答案 0 :(得分:1)
首先,找出你的瓶颈所在。
如果是在json解码/编码步骤,请尝试切换到ultrajson
:
UltraJSON是一个用纯C编写的超快速JSON编码器和解码器 绑定Python 2.5+和3。
更改将像更改导入部分一样简单:
try:
import ujson as json
except ImportError:
try:
import simplejson as json
except ImportError:
import json
我还在What is faster - Loading a pickled dictionary object or Loading a JSON file - to a dictionary?做了一个简单的基准测试,看一看。
答案 1 :(得分:0)
我更新了脚本以尝试不同的实验,发现是的,json解析是cpu绑定的。我得到28MB / s,这比你的.04Gig每分钟(> 1 MB / s)更好,所以不确定那里发生了什么。当跳过json的东西并且只是写入文件时,我得到了996 MB / s。
在下面的代码中,您可以使用python slow.py create
生成数据集,并通过更改标记为todo:
的代码来测试多个方案。我的数据集只有800 MB,因此I / O被RAM缓存吸收(运行两次以确保要读取的文件已被缓存)。
我很惊讶json解码是如此密集。
from __future__ import print_function, division
import os
from glob import glob
from os import getpid
from time import time
from sys import stdout
import resource
from multiprocessing import Pool, cpu_count
import subprocess
# todo: pick your poison
#import json
#import ujson as json
import simplejson as json
import sys
# todo: choose your data path
#path = '/data/data//*.A.1'
#path = '/tmp/mytest'
path = os.path.expanduser('~/tmp/mytest')
# todo: choose your cores
#cores = 12
cores = cpu_count()
print("Running with PID: %d" % getpid())
def process_file(file):
start = time()
filename =file.split('/')[-1]
print(file)
with open(file + '.out', 'w', buffering=1024*1024) as w:
with open(file, 'r', buffering=1024*1024) as f:
for n, line in enumerate(f):
# todo: for pure bandwidth calculations
#w.write(line)
#continue
try:
d = json.loads(line)
except Exception, e:
raise RuntimeError("'%s' in %s: %s" % (str(e), file, line))
try:
domain = d['rrname']
ips = d['rdata']
for i in ips:
print("%s|%s" % (i, domain), file=w)
except:
print (d, 'error')
pass
return os.stat(file).st_size
def create_files(path, files, entries):
print('creating files')
extra = [i for i in range(32)]
if not os.path.exists(path):
os.makedirs(path)
for i in range(files):
fn = os.path.join(path, 'in%d.json' % i)
print(fn)
with open(fn, 'w') as fp:
for j in range(entries):
json.dump({'rrname':'fred',
'rdata':[str(k) for k in range(10)],
'extra':extra},fp)
fp.write('\n')
if __name__ == "__main__":
if 'create' in sys.argv:
create_files(path, 1000, 100000)
sys.exit(0)
files_list = glob(os.path.join(path, '*.json'))
print('processing', len(files_list), 'files in', path)
print("Using %d cores" % cores)
pp = Pool(processes=cores)
total = 0
start = time()
for result in pp.imap_unordered(process_file, files_list):
total += result
pp.close()
pp.join()
delta = time() - start
mb = total/1000000
print('%d MB total, %d MB/s' % (mb, mb/delta))
答案 2 :(得分:0)
来自
import json
到
import ujson
https://artem.krylysov.com/blog/2015/09/29/benchmark-python-json-libraries/
答案 3 :(得分:0)
用于安装:
pip install orjson
对于导入:
import orjson as json
这在您想转储或加载大尺寸数组时尤其有效。