使用一堆类似于此的数据框块,其中Version和ASSAY一起形成唯一标识符:
Version ASSAY Resp_Rob_Sigmas
0 A123 F 0.56
1 B234 G 0.78
2 C345 R 0.9
3 D456 F 1.0
4 D456 G 0.3
我创建一个数据透视表需要看起来像这样:
F G R
A123 0.56 NA NA
B234 NA 0.78 NA
C345 NA NA 0.9
D456 1.0 0.3 NA
预先分块和预先解压缩,数据帧为13 GB,因此数据透视表在创建过程中会爆炸,从而导致内存错误。我目前的代码如下:
import pandas as pd
import zipfile
# Number of lines to be read at a time from csv
chunk_size = 10 ** 5
merged_df = pd.DataFrame([])
folder = zipfile.ZipFile(OP_DIRECTORY + "/file.zip")
# Reading csv in chunks, dropping columns, dropping rows with null responses.
for chunk in pd.read_csv(folder.open("file.csv"), chunksize=chunk_size):
df = pd.DataFrame(chunk)
# Operations on df
...
...
merged_df = merged_df.append(df)
# Pivoting data to create matrix
df = pd.pivot_table(merged_df, index=['Version'], values=['Resp_Rob_Sigmas'], columns=['ASSAY'])
df.to_csv("output.csv")
如何防止内存错误并对其进行优化?