我有很多python dict格式的存档数据,我正在将其转换为json。如果可以的话,我有兴趣加快这个过程,并且想知道是否有人有任何建议。目前我:
这是我目前的工作代码:
import gzip
import ast
import json
import glob
import fileinput
from dateutil import parser
line = []
filestobeanalyzed = glob.glob('./data/*.json.gz')
for fileName in filestobeanalyzed:
inputfilename = fileName
print inputfilename # to keep track of where I'm at
for line in fileinput.input(inputfilename, openhook=fileinput.hook_compressed):
line = line.strip();
if not line: continue
try:
line = ast.literal_eval(line)
line = json.dumps(line)
except:
continue
date = json.loads(line).get('created_at')
if not date: continue
date_converted = parser.parse(date).strftime('%Y%m%d')
outputfilename = gzip.open(date_converted, "a")
outputfilename.write(line)
outputfilename.write("\n")
outputfilename.close()
必须有一种更有效的方式来做到这一点,我只是没有看到它。有没有人有任何建议?
答案 0 :(得分:1)
首先使用http://packages.python.org/line_profiler/对其进行分析,它会逐行为您提供时序,您也可以尝试多线程,使用多处理池并将文件列表映射到您的函数。
虽然我觉得IO可能是一个问题,但这又取决于文件的大小。
# Assuming you have a large dictionary across a multi-part gzip files.
def files_to_be_analyzed(files):
lines = ast.literal_eval("".join([gzip.open(file).read() for file in files]))
date = lines['created_at']
date_converted = parser.parse(date).strftime('%Y%m%d')
output_file = gzip.open(date_converted, "a")
output_file.write(lines + "\n")
output_file.close()
我仍然不太确定你想要实现什么,但尝试尽可能少地读写,因为IO是一个杀手。如果你需要处理多个目录,那么看多线程就像这样。
import gzip
import ast
import json
import glob
import fileinput
from dateutil import parser
from multiprocessing import Pool
# Assuming you have a large dictionary across a multi-part gzip files.
def files_to_be_analyzed(files):
lines = ast.literal_eval("".join([gzip.open(file).read() for file in files]))
date = lines['created_at']
date_converted = parser.parse(date).strftime('%Y%m%d')
output_file = gzip.open(date_converted, "a")
output_file.write(lines + "\n")
output_file.close()
if __name__ == '__main__':
pool = Pool(processes = 5) # Or what ever number of cores you have
directories = ['/path/to/this/dire', '/path/to/another/dir']
pool.map(files_to_be_analyzed, [glob.glob(path) for path in directories])
pools.close()
pools.join()