我有大约90个数据帧,每个数据帧约为1 Gb,行数约为500万。
在90中的每一个中都有一个与所有其他数据帧匹配的唯一ID。
其中两个例子是
DF1
Year ID Value
1950 1 0.4
1950 2 0.2
1950 3 0.1
1950 4 0.8
DF2
Year ID Value
1951 1 0.9
1951 2 0.6
1951 3 0.7
1951 4 0.6
我想在ID
匹配的所有帧中取平均值。由于每个单独的文件都很大,我无法将它们全部保存在内存中。我提出了一种方法,但速度非常慢,我希望有更好的方法。
目前的做法是:
import pandas as pd
import os
import numpy as np
#list with unique ids found in all frames
uniques = np.arange(1,5000000, 1)
#loop through files
files = "C:/path_to_csvs"
#empty dataframe to store all means
final = pd.DataFrame()
for i in uniques:
#empty dataframe to append a single matching unique ID
single_combined = pd.DataFrame()
for f in os.listdir(files):
df2 = pd.read_csv(os.path.join(files, f))
#select rows where the id's match
df2 = df2[(df2['ID'] == i)]
#if there is a match, append the row
if df2.shape[0] != 0:
single_combined = single_combined.append(df2)
#groupby ID to get the means of value
means = single_combined.groupby(['ID'])[['Value]].mean().reset_index()
#append the mean to the final dataframe
final = final.append(means)
print(final)
答案 0 :(得分:1)
想法:sum
= count
/ sum
,让我们一个一个地读取所有文件,计算count
和{{ 1}}(size
)为每个人,并总结,存储累计sum
和count
。完成所有文件后,我们可以轻松计算mean
= sum
/ count
。
因此,请考虑以下方法:
import glob
files = glob.glob('d:/temp/.data/46307213/*.csv')
res = pd.DataFrame()
for f in files:
res = pd.concat([res,
pd.read_csv(f).groupby('ID')['Value']
.agg(['sum', 'size'])]) \
.groupby('ID').sum()
res['mean'] = res.pop('sum') / res.pop('size')
演示:
源CSV文件:
1.csv:
Year,ID,Value
1950,1,0.4
1950,2,0.2
1950,3,0.1
1950,4,0.8
2.csv:
Year,ID,Value
1951,1,0.9
1951,2,0.6
1951,5,0.7
1951,6,0.6
3.csv:
Year,ID,Value
1952,1,0.9
1952,1,0.6
1952,5,0.7
1952,5,0.6
结果:
In [103]: %paste
import glob
files = glob.glob('d:/temp/.data/46307213/*.csv')
res = pd.DataFrame()
for f in files:
res = pd.concat([res,
pd.read_csv(f).groupby('ID')['Value']
.agg(['sum', 'size'])]) \
.groupby('ID').sum()
res['mean'] = res.pop('sum') / res.pop('size')
print(res)
## -- End pasted text --
mean
ID
1 0.700000
2 0.400000
3 0.100000
4 0.800000
5 0.666667
6 0.600000
结论:每个文件只能从磁盘读取一次。