我有下面的内容,它只解析包含json对象的文本文件,然后在保存到磁盘之前转换为数据帧为csv。我试图弄清楚这是否是内存效率最高的方式,以及逐渐耗尽内存的因素,因为它会运行大约200个不同的文件,每个文件保存为1-10M行。我注意到它在完成时最终耗尽了超过50GB的内存。更具体地说,如果我不关心进行任何分析但仅将数据转换为csv并且是否存在不会降低我内存的不同实现,那么使用Pandas和Dataframes是最佳选择吗?
def readfiles(pattern, sourcefile):
#iterate through all zip files in datadir and yield trigger data
try:
with zipfile.ZipFile(sourcefile, 'r') as myzip:
for logfile in myzip.namelist():
for line in myzip.open(logfile):
try:
line = ujson.loads(line.rstrip('\n').rstrip(','))
if pattern in line:
for i in line['key1']:
yield i, line['key2']['key3'],\
line['key4']['key5'],line['key6'],\
line['key7']['key8'],line['key9']['key10']
except ValueError as err:
pass
except zipfile.error, e:
pass
def convertdfcsv(lines, filename):
"""Consumer for readfiles function that saves dataframe as csv."""
df = pd.DataFrame.from_records(lines)
#return df
df.to_csv(os.path.join(triggertempdir, filename), index=False, header=None)
print("Completed Processing {}".format(filename))
def main(pattern, min_date, max_date):
"""Main function to initiate pipeline"""
sourcezipfiles = retrieve_from_s3(date)
lines = readfiles(pattern, i)
csvout = '{}.csv'.format(i[:-4])
convertdfcsv(lines, csvout)
答案 0 :(得分:0)
我只会使用命令行工具,例如jq或json2csv,但我还没理解你的解析。
这是一个将以逗号分隔的方式提取key1和key2并输出的示例: jq -r'[。key1,.key2] | @csv'