我尝试合并两个大尺寸数据帧。
一个数据框(patent_id)有5,271,459行,其他数据库有超过10,000列。
要合并这两个大数据帧,我使用“合并”并将正确的数据框分成块。 (与MemoryError with python/pandas and large left outer joins类似)
但它仍然遇到内存错误。有没有可以改进的空间?
我应该使用“concat”而不是“merge”吗?
或者我应该使用“csv”而不是“pandas”来管理这个问题,如(MemoryError with python/pandas and large left outer joins)?
for key in column_name:
print key
newname = '{}_post.csv'.format(key)
patent_rotated_chunks = pd.read_csv(newname, iterator=True, chunksize=10000)
temp = patent_id.copy(deep=True)
for patent_rotated in patent_rotated_chunks:
temp = pd.merge(temp,patent_rotated,on = ["patent_id_0"],how = 'left')
temp.to_csv('{}_sorted.csv'.format(key))
del temp
答案 0 :(得分:1)
以下方法适用于我MemoryError with python/pandas and large left outer joins
import csv
def gen_chunks(reader, chunksize=1000000):
chunk = []
for i, line in enumerate(reader):
if (i % chunksize == 0 and i > 0):
yield chunk
del chunk[:]
chunk.append(line)
yield chunk
for key in column_name:
idata = open("patent_id.csv","rU")
newcsv = '{}_post.csv'.format(key)
odata = open(newcsv,"rU")
leftdata = csv.reader(idata)
next(leftdata)
rightdata = csv.reader(odata)
index = next(rightdata).index("patent_id_0")
odata.seek(0)
columns = ["project_id"] + next(rightdata)
rd = dict([(rows[index], rows) for rows in rightdata])
print rd.keys()[0]
print rd.values()[0]
with open('{}_sorted.csv'.format(key), "wb") as csvfile:
output = csv.writer(csvfile)
output.writerows(columns)
for chunk in gen_chunks(leftdata):
print key, " New Chunk!"
ld = [[pid[1]]+ rd.get(pid[1], ["NaN"]) for pid in chunk]
output.writerows(ld)