用于创建CSV和内存使用的Python Pandas

时间:2014-10-13 01:04:49

标签: python pandas

我有下面的内容,它只解析包含json对象的文本文件,然后在保存到磁盘之前转换为数据帧为csv。我试图弄清楚这是否是内存效率最高的方式,以及逐渐耗尽内存的因素,因为它会运行大约200个不同的文件,每个文件保存为1-10M行。我注意到它在完成时最终耗尽了超过50GB的内存。更具体地说,如果我不关心进行任何分析但仅将数据转换为csv并且是否存在不会降低我内存的不同实现,那么使用Pandas和Dataframes是最佳选择吗?

def readfiles(pattern, sourcefile):
    #iterate through all zip files in datadir and yield trigger data
    try:
        with zipfile.ZipFile(sourcefile, 'r') as myzip:
            for logfile in myzip.namelist():
                for line in myzip.open(logfile):
                    try:
                        line = ujson.loads(line.rstrip('\n').rstrip(','))
                        if pattern in line:
                            for i in line['key1']:
                                yield i, line['key2']['key3'],\
                            line['key4']['key5'],line['key6'],\
                            line['key7']['key8'],line['key9']['key10']
                    except ValueError as err:
                        pass
    except zipfile.error, e:
        pass

def convertdfcsv(lines, filename):
    """Consumer for readfiles function that saves dataframe as csv."""

    df = pd.DataFrame.from_records(lines)
    #return df
    df.to_csv(os.path.join(triggertempdir, filename), index=False, header=None)
    print("Completed Processing {}".format(filename))

def main(pattern, min_date, max_date):
    """Main function to initiate pipeline"""
    sourcezipfiles = retrieve_from_s3(date)
    lines = readfiles(pattern, i)
    csvout = '{}.csv'.format(i[:-4])
    convertdfcsv(lines, csvout)

1 个答案:

答案 0 :(得分:0)

我只会使用命令行工具,例如jq或json2csv,但我还没理解你的解析。

这是一个将以逗号分隔的方式提取key1和key2并输出的示例: jq -r'[。key1,.key2] | @csv'