如何在熊猫中优化内存使用

时间:2016-06-26 07:15:49

标签: python pandas out-of-memory

我尝试使用pandas合并3个大约3GB,200Kb和200kb的文件,而我的计算机有32G内存,但它仍以MemoryError结束。有什么方法可以避免这个问题吗?我的合并代码如下:

product = pd.read_csv("../data/process_product.csv", header=0)
product["bandID"] = pd.factorize(product.Band)[0]
product = product.drop('Band', 1)
product = product.drop('Info', 1)

town_state = pd.read_csv("../data/town_state.csv", header=0)
dumies = pd.get_dummies(town_state.State)
town_state = pd.concat([town_state, dumies], axis=1)
town_state["townID"] = pd.factorize(town_state.Town)[0]
town_state = town_state.drop('State', 1)
town_state = town_state.drop('Town', 1)
train = pd.read_csv("../data/train.csv", header=0)

result = pd.merge(train, town_state, on="Agencia_ID", how='left')
result = pd.merge(result, product, on="Producto_ID", how='left')
result.to_csv("../data/train_data.csv")

1 个答案:

答案 0 :(得分:1)

这是我的“微观”优化尝试:

您不使用(需要)Info中的process_product.csv列,因此无需阅读它:

cols = [<list of columns, EXCEPT Info column>]
product = pd.read_csv("../data/process_product.csv", usecols=cols)
product['Band'] = pd.factorize(product.Band)[0]
product.rename(columns={'Band':'bandID'}, inplace=True)

我们可以尝试在dumies变量上保存一些内存 - 动态使用get_dummies()并使用sparse=True参数:

town_state = pd.concat([town_state, pd.get_dummies(town_state.State, sparse=True)], axis=1)
del town_state['State']
town_state['Town'] = pd.factorize(town_state.Town)[0]
town_state.rename(columns={'Town':'townID'}, inplace=True)

尝试保存result DF,尽快从内存中删除town_state

train = pd.merge(train, town_state, on="Agencia_ID", how='left')
del town_state
train = pd.merge(train, product, on="Producto_ID", how='left')
del product

PS我不知道哪个文件/ DF是最大的(32GB),所以我做了一个假设,它是train DF。如果它是product DF,那么我会这样做:

product = pd.merge(train, product, on="Producto_ID", how='left')
del train
product = pd.merge(product, town_state, on="Agencia_ID", how='left')
del town_state