我尝试使用pandas合并3个大约3GB,200Kb和200kb的文件,而我的计算机有32G内存,但它仍以MemoryError结束。有什么方法可以避免这个问题吗?我的合并代码如下:
product = pd.read_csv("../data/process_product.csv", header=0)
product["bandID"] = pd.factorize(product.Band)[0]
product = product.drop('Band', 1)
product = product.drop('Info', 1)
town_state = pd.read_csv("../data/town_state.csv", header=0)
dumies = pd.get_dummies(town_state.State)
town_state = pd.concat([town_state, dumies], axis=1)
town_state["townID"] = pd.factorize(town_state.Town)[0]
town_state = town_state.drop('State', 1)
town_state = town_state.drop('Town', 1)
train = pd.read_csv("../data/train.csv", header=0)
result = pd.merge(train, town_state, on="Agencia_ID", how='left')
result = pd.merge(result, product, on="Producto_ID", how='left')
result.to_csv("../data/train_data.csv")
答案 0 :(得分:1)
这是我的“微观”优化尝试:
您不使用(需要)Info
中的process_product.csv
列,因此无需阅读它:
cols = [<list of columns, EXCEPT Info column>]
product = pd.read_csv("../data/process_product.csv", usecols=cols)
product['Band'] = pd.factorize(product.Band)[0]
product.rename(columns={'Band':'bandID'}, inplace=True)
我们可以尝试在dumies
变量上保存一些内存 - 动态使用get_dummies()
并使用sparse=True
参数:
town_state = pd.concat([town_state, pd.get_dummies(town_state.State, sparse=True)], axis=1)
del town_state['State']
town_state['Town'] = pd.factorize(town_state.Town)[0]
town_state.rename(columns={'Town':'townID'}, inplace=True)
尝试保存result
DF,尽快从内存中删除town_state
:
train = pd.merge(train, town_state, on="Agencia_ID", how='left')
del town_state
train = pd.merge(train, product, on="Producto_ID", how='left')
del product
PS我不知道哪个文件/ DF是最大的(32GB),所以我做了一个假设,它是train
DF。如果它是product
DF,那么我会这样做:
product = pd.merge(train, product, on="Producto_ID", how='left')
del train
product = pd.merge(product, town_state, on="Agencia_ID", how='left')
del town_state