Question

我尝试合并两个大尺寸数据帧。

一个数据框（patent_id）有5,271,459行，其他数据库有超过10,000列。

要合并这两个大数据帧，我使用“合并”并将正确的数据框分成块。（与MemoryError with python/pandas and large left outer joins类似）

但它仍然遇到内存错误。有没有可以改进的空间？

我应该使用“concat”而不是“merge”吗？

或者我应该使用“csv”而不是“pandas”来管理这个问题，如（MemoryError with python/pandas and large left outer joins）？

for key in column_name:
    print key
    newname = '{}_post.csv'.format(key)
    patent_rotated_chunks = pd.read_csv(newname, iterator=True, chunksize=10000)    

    temp = patent_id.copy(deep=True)

    for patent_rotated in patent_rotated_chunks: 
        temp = pd.merge(temp,patent_rotated,on = ["patent_id_0"],how = 'left')

    temp.to_csv('{}_sorted.csv'.format(key))

    del temp

Answer 1

以下方法适用于我MemoryError with python/pandas and large left outer joins

import csv

def gen_chunks(reader, chunksize=1000000):
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]
        chunk.append(line)
    yield chunk

for key in column_name:

    idata = open("patent_id.csv","rU")
    newcsv = '{}_post.csv'.format(key)
    odata = open(newcsv,"rU")

    leftdata = csv.reader(idata)
    next(leftdata)

    rightdata = csv.reader(odata)

    index = next(rightdata).index("patent_id_0")
    odata.seek(0)
    columns = ["project_id"] + next(rightdata)
    rd = dict([(rows[index], rows) for rows in rightdata])

    print rd.keys()[0] 
    print rd.values()[0]

    with open('{}_sorted.csv'.format(key), "wb") as csvfile:
        output = csv.writer(csvfile)
        output.writerows(columns)

        for chunk in gen_chunks(leftdata):
            print key, " New Chunk!"
            ld = [[pid[1]]+ rd.get(pid[1], ["NaN"]) for pid in chunk]
            output.writerows(ld)

对于大型熊猫数据帧（大于5~20GB），我们如何使用“左外连接”？

1 个答案: