假设我有4个小型DataFrame
df1
,df2
,df3
和df4
import pandas as pd
from functools import reduce
import numpy as np
df1 = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
df2 = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
df3 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])
df4 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])
df1.columns = ['name', 'id', 'price']
df2.columns = ['name', 'id', 'price']
df3.columns = ['name', 'id', 'price']
df4.columns = ['name', 'id', 'price']
df1 = df1.rename(columns={'price':'pricepart1'})
df2 = df2.rename(columns={'price':'pricepart2'})
df3 = df3.rename(columns={'price':'pricepart3'})
df4 = df4.rename(columns={'price':'pricepart4'})
上面创建的是4个DataFrame,我想在下面的代码中找到它。
# Merge dataframes
df = pd.merge(df1, df2, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df3, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df4, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
# Fill na values with 'missing'
df = df.fillna('missing')
所以我已经为4个没有很多行和列的DataFrame实现了这个目标。
基本上,我想将上面的外部合并解决方案扩展到大小为62245 X 3的MULTIPLE(48)数据框:
所以我通过构建另一个使用lambda reduce的StackOverflow答案来提出这个解决方案:
from functools import reduce
import pandas as pd
import numpy as np
dfList = []
#To create the 48 DataFrames of size 62245 X 3
for i in range(0, 49):
dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name', 'id', 'pricepart' + str(i + 1)]))
#The solution I came up with to extend the solution to more than 3 DataFrames
df_merged = reduce(lambda left, right: pd.merge(left, right, left_on=['name', 'id'], right_on=['name', 'id'], how='outer'), dfList).fillna('missing')
这导致MemoryError
。
我不知道如何阻止内核死亡......我已经坚持了两天..我执行的EXACT合并操作的一些代码不会导致{{ 1}}或者给你相同结果的东西,真的很感激。
此外,主DataFrame中的3列(不是示例中可重现的48个DataFrame)的类型为MemoryError
,int64
和int64
,我更喜欢它们保持这种方式,因为它代表的整数和浮点数。
编辑:
我没有迭代地尝试运行合并操作或使用reduce lambda函数,而是以2个为一组进行操作!另外,我已经更改了某些列的数据类型,有些不需要float64
。所以我把它归结为float64
。它变得非常远但仍然最终抛出float16
。
MemoryError
有没有什么方法可以优化我的代码以避免intermediatedfList = dfList
tempdfList = []
#Until I merge all the 48 frames two at a time, till it becomes size 2
while(len(intermediatedfList) != 2):
#If there are even number of DataFrames
if len(intermediatedfList)%2 == 0:
#Go in steps of two
for i in range(0, len(intermediatedfList), 2):
#Merge DataFrame in index i, i + 1
df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
print(df1.info(memory_usage='deep'))
#Append it to this list
tempdfList.append(df1)
#After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList,
#Set intermediatedfList to be equal to tempdfList, so it can continue the while loop.
intermediatedfList = tempdfList
else:
#If there are odd number of DataFrames, keep the first DataFrame out
tempdfList = [intermediatedfList[0]]
#Go in steps of two starting from 1 instead of 0
for i in range(1, len(intermediatedfList), 2):
#Merge DataFrame in index i, i + 1
df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
print(df1.info(memory_usage='deep'))
tempdfList.append(df1)
#After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList,
#Set intermediatedfList to be equal to tempdfList, so it can continue the while loop.
intermediatedfList = tempdfList
,我甚至使用了AWS 192GB内存(我现在欠他们7美元我可以“给你一个yall”),它比我得到的更远,并且在将28个DataFrames列表减少到4之后仍然会抛出MemoryError
。
答案 0 :(得分:3)
使用pd.concat
执行索引对齐并置可能会带来一些好处。这应该比外部合并更快,更高效。
df_list = [df1, df2, ...]
for df in df_list:
df.set_index(['name', 'id'], inplace=True)
df = pd.concat(df_list, axis=1) # join='inner'
df.reset_index(inplace=True)
或者,您可以通过迭代concat
替换join
(第二步):
from functools import reduce
df = reduce(lambda x, y: x.join(y), df_list)
这可能会或可能不会比merge
更好。
答案 1 :(得分:1)
您可以尝试一个简单的for
循环。我应用的唯一内存优化是通过int
向下转换为最佳pd.to_numeric
类型。
我也在使用字典来存储数据帧。这是保存可变数量变量的好习惯。
import pandas as pd
dfs = {}
dfs[1] = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
dfs[2] = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
dfs[3] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])
dfs[4] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])
df = dfs[1].copy()
for i in range(2, max(dfs)+1):
df = pd.merge(df, dfs[i].rename(columns={2: i+1}),
left_on=[0, 1], right_on=[0, 1], how='outer').fillna(-1)
df.iloc[:, 2:] = df.iloc[:, 2:].apply(pd.to_numeric, downcast='integer')
print(df)
0 1 2 3 4 5
0 a 1 10 15 -1 -1
1 a 2 20 20 -1 -1
2 b 1 4 -1 -1 -1
3 c 1 2 2 -1 -1
4 e 2 10 -1 20 20
5 d 1 -1 -1 10 10
6 f 1 -1 -1 1 15
您通常不应将字符串(例如“缺失”)与数字类型组合在一起,因为这会将整个系列变为object
类型系列。我们在这里使用-1
,但您可能希望将NaN
与float
dtype一起使用。
答案 2 :(得分:1)
看起来像是设计快数据框的一部分(具有数据框的内存操作)。看到 Best way to join two large datasets in Pandas例如代码。抱歉,没有复制和粘贴,但不想让我看起来好像想从链接条目中的应答者那里学分。
答案 3 :(得分:0)
因此,您有48个df,每个3列-名称,id和每个df的不同列。
您不必使用合并。...
相反,如果合并所有dfs
df = pd.concat([df1,df2,df3,df4])
您将收到:
Out[3]:
id name pricepart1 pricepart2 pricepart3 pricepart4
0 1 a 10.0 NaN NaN NaN
1 2 a 20.0 NaN NaN NaN
2 1 b 4.0 NaN NaN NaN
3 1 c 2.0 NaN NaN NaN
4 2 e 10.0 NaN NaN NaN
0 1 a NaN 15.0 NaN NaN
1 2 a NaN 20.0 NaN NaN
2 1 c NaN 2.0 NaN NaN
0 1 d NaN NaN 10.0 NaN
1 2 e NaN NaN 20.0 NaN
2 1 f NaN NaN 1.0 NaN
0 1 d NaN NaN NaN 10.0
1 2 e NaN NaN NaN 20.0
2 1 f NaN NaN NaN 15.0
现在,您可以按名称和ID分组并取总和:
df.groupby(['name','id']).sum().fillna('missing').reset_index()
如果您将使用48个dfs尝试一下,则会看到它解决了MemoryError:
dfList = []
#To create the 48 DataFrames of size 62245 X 3
for i in range(0, 49):
dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name', 'id', 'pricepart' + str(i + 1)]))
df = pd.concat(dfList)
df.groupby(['name','id']).sum().fillna('missing').reset_index()