匹配数据帧之间的行,并取所有匹配的平均值

时间:2017-09-19 18:12:42

标签: python-2.7 pandas

我有大约90个数据帧,每个数据帧约为1 Gb,行数约为500万。

在90中的每一个中都有一个与所有其他数据帧匹配的唯一ID。

其中两个例子是

DF1

Year ID Value
1950 1  0.4
1950 2  0.2
1950 3  0.1
1950 4  0.8

DF2

Year ID Value
1951 1  0.9
1951 2  0.6
1951 3  0.7
1951 4  0.6

我想在ID匹配的所有帧中取平均值。由于每个单独的文件都很大,我无法将它们全部保存在内存中。我提出了一种方法,但速度非常慢,我希望有更好的方法。

目前的做法是:

import pandas as pd
import os
import numpy as np

#list with unique ids found in all frames
uniques = np.arange(1,5000000, 1)

#loop through files
files = "C:/path_to_csvs"

#empty dataframe to store all means
final = pd.DataFrame()

for i in uniques:

    #empty dataframe to append a single matching unique ID
    single_combined = pd.DataFrame()

    for f in os.listdir(files):

        df2 = pd.read_csv(os.path.join(files, f))     

        #select rows where the id's match
        df2 = df2[(df2['ID'] == i)]

        #if there is a match, append the row
        if df2.shape[0] != 0:

             single_combined =  single_combined.append(df2)

    #groupby ID to get the means of value
    means = single_combined.groupby(['ID'])[['Value]].mean().reset_index()

    #append the mean to the final dataframe
    final = final.append(means)

print(final)

1 个答案:

答案 0 :(得分:1)

想法sum = count / sum,让我们一个一个地读取所有文件,计算count和{{ 1}}(size)为每个人,并总结,存储累计sumcount。完成所有文件后,我们可以轻松计算mean = sum / count

因此,请考虑以下方法:

import glob

files = glob.glob('d:/temp/.data/46307213/*.csv')

res = pd.DataFrame()

for f in files:
    res = pd.concat([res,
                     pd.read_csv(f).groupby('ID')['Value']
                       .agg(['sum', 'size'])]) \
            .groupby('ID').sum()

res['mean'] = res.pop('sum') / res.pop('size')

演示:

源CSV文件:

1.csv:

Year,ID,Value
1950,1,0.4
1950,2,0.2
1950,3,0.1
1950,4,0.8

2.csv:

Year,ID,Value
1951,1,0.9
1951,2,0.6
1951,5,0.7
1951,6,0.6

3.csv:

Year,ID,Value
1952,1,0.9
1952,1,0.6
1952,5,0.7
1952,5,0.6

结果:

In [103]: %paste
import glob

files = glob.glob('d:/temp/.data/46307213/*.csv')
res = pd.DataFrame()
for f in files:
    res = pd.concat([res,
                     pd.read_csv(f).groupby('ID')['Value']
                       .agg(['sum', 'size'])]) \
            .groupby('ID').sum()
res['mean'] = res.pop('sum') / res.pop('size')
print(res)

## -- End pasted text --
        mean
ID
1   0.700000
2   0.400000
3   0.100000
4   0.800000
5   0.666667
6   0.600000

结论:每个文件只能从磁盘读取一次。