根据一列添加多个数据框

时间:2018-09-16 12:42:26

标签: python pandas dataframe average

我有几百个具有相同列名的数据框,如下所示:

df1

        wave  num  stlines     fwhm       EWs  MeasredWave  
0    4050.32    3  0.28269  0.07365  22.16080  4050.311360   
1    4208.98    5  0.48122  0.08765  44.90035  4208.972962   
2    4374.94    9  0.71483  0.11429  86.96497  4374.927110   
3    4379.74    9  0.31404  0.09107  30.44271  4379.760601   
4    4398.01   14  0.50415  0.09845  52.83236  4398.007473 
5    5520.50    1  0.06148  0.12556   8.21685  5520.484742 

df2

        wave  num  stlines     fwhm       EWs  MeasredWave  
0    4050.32    3  0.28616  0.07521  22.91064  4050.327388   
1    4208.98    6  0.48781  0.08573  44.51609  4208.990029   
2    4374.94    9  0.71548  0.11437  87.10152  4374.944513   
3    4379.74   10  0.31338  0.09098  30.34791  4379.778009   
4    4398.01   15  0.49950  0.08612  45.78707  4398.020367   
5    4502.21    9  0.56362  0.10114  60.67868  4502.223123   
6    4508.28    3  0.69554  0.11600  85.88428  4508.291777   
7    4512.99    2  0.20486  0.08891  19.38745  4512.999332
8    5520.50    1  0.06148  0.12556   8.21685  5520.484742

这就是我阅读它们的方式

path_to_files = '/home/Desktop/computed_2d/'
lst = []

for filen in dir1:
   df = pd.read_table(path_to_files+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')

   lst.append(df)

所需的结果应如下所示:

      wave   num   stlines      fwhm        EWs  MeasredWave
0  4050.32   3.0  0.284425  0.074430  22.535720  4050.319374
1  4208.98   5.5  0.484515  0.086690  44.708220  4208.981496
2  4374.94   9.0  0.715155  0.114330  87.033245  4374.935812
3  4379.74   9.5  0.313710  0.091025  30.395310  4379.769305
4  4398.01  14.5  0.501825  0.092285  49.309715  4398.013920
5  4502.21    9   0.56362   0.10114   60.67868   4502.223123   
6  4508.28    3   0.69554   0.11600   85.88428   4508.291777   
7  4512.99    2   0.20486   0.08891   19.38745   4512.999332
8  5520.50   1.0  0.061480  0.125560  8.216850   5520.484742

如您所见,行数不相同。现在,我想获取基于column1 wave的所有数据帧的平均值,并确保将wave的列df1的每个索引添加到{的正确索引中{1}}

3 个答案:

答案 0 :(得分:1)

以下是执行所需操作的示例:

import pandas as pd

df1 = pd.DataFrame({'A': [0, 1, 2, 3],
                    'B': [0, 1, 2, 3],
                    'C': [0, 1, 2, 3],
                    'D': [0, 1, 2, 3]},
                    index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': [4, 5, 6, 7],
                    'B': [4, 5, 6, 7],
                    'C': [4, 5, 6, 7],
                    'D': [4, 5, 6, 7]},
                    index=[0, 1, 2, 3])

df3 = pd.DataFrame({'A': [8, 9, 10, 11],
                    'B': [8, 9, 10, 11],
                    'C': [8, 9, 10, 11],
                    'D': [8, 9, 10, 11]},
                    index=[0, 1, 2, 3])

df4 =  pd.concat([df1, df2, df3])
df5 = pd.concat([df1, df2, df3], ignore_index=True)
print(df4)
print('\n\n')
print(df5)

print(f"Average of column A = {df4['A'].mean()}")

您将拥有

    A   B   C   D
0   0   0   0   0
1   1   1   1   1
2   2   2   2   2
3   3   3   3   3
0   4   4   4   4
1   5   5   5   5
2   6   6   6   6
3   7   7   7   7
0   8   8   8   8
1   9   9   9   9
2  10  10  10  10
3  11  11  11  11



     A   B   C   D
0    0   0   0   0
1    1   1   1   1
2    2   2   2   2
3    3   3   3   3
4    4   4   4   4
5    5   5   5   5
6    6   6   6   6
7    7   7   7   7
8    8   8   8   8
9    9   9   9   9
10  10  10  10  10
11  11  11  11  11

Average of column A = 5.5

答案 1 :(得分:1)

您可以使用pd.concat wich axis = 1将所有数据帧堆叠为一个,并取各列的平均值

df3 = pd.merge(df1,df2,on=['wave'],how ='outer',)
df4 = df3.rename(columns = lambda x: x.split('_')[0]).T
df4.groupby(df4.index).mean().T

出局:

    EWs         MeasredWave fwhm        num stlines      wave
0   22.535720   4050.319374 0.074430    3.0 0.284425    4050.32
1   44.708220   4208.981496 0.086690    5.5 0.484515    4208.98
2   87.033245   4374.935812 0.114330    9.0 0.715155    4374.94
3   30.395310   4379.769305 0.091025    9.5 0.313710    4379.74
4   49.309715   4398.013920 0.092285    14.5 0.501825   4398.01
5   8.216850    5520.484742 0.125560    1.0 0.061480    5520.50
6   60.678680   4502.223123 0.101140    9.0 0.563620    4502.21
7   85.884280   4508.291777 0.116000    3.0 0.695540    4508.28
8   19.387450   4512.999332 0.088910    2.0 0.204860    4512.9

答案 2 :(得分:1)

@Naga Kiran的回答很棒。我在这里更新了整个解决方案:

import pandas as pd

df1 = pd.DataFrame(
  {'wave'        : [4050.32, 4208.98, 4374.94, 4379.74, 4398.01, 5520.50],
   'num'         : [3, 5, 9, 9, 14, 1],
   'stlines'     : [0.28269, 0.48122, 0.71483, 0.31404, 0.50415, 0.06148],
   'fwhm'        : [0.07365, 0.08765, 0.11429, 0.09107, 0.09845, 0.12556],
   'EWs'         : [22.16080, 44.90035, 86.96497, 30.44271, 52.83236, 8.21685],
   'MeasredWave' : [4050.311360, 4208.972962, 4374.927110, 4379.760601, 4398.007473, 5520.484742]},
   index=[0, 1, 2, 3, 4, 5])

df2 = pd.DataFrame(
  {'wave'        : [4050.32, 4208.98, 4374.94, 4379.74, 4398.01, 4502.21, 4508.28, 4512.99, 5520.50],
   'num'         : [3, 6, 9, 10, 15, 9, 3, 2, 1],
   'stlines'     : [0.28616, 0.48781, 0.71548, 0.31338, 0.49950, 0.56362, 0.69554, 0.20486, 0.06148],
   'fwhm'        : [0.07521, 0.08573, 0.11437, 0.09098, 0.08612, 0.10114, 0.11600, 0.08891, 0.12556],
   'EWs'         : [22.91064, 44.51609, 87.10152, 30.34791, 45.78707, 60.67868, 85.88428, 19.38745, 8.21685],
   'MeasredWave' : [4050.327388, 4208.990029, 4374.944513, 4379.778009, 4398.020367, 4502.223123, 4508.291777, 4512.999332, 5520.484742]},
   index=[0, 1, 2, 3, 4, 5, 6, 7, 8])

df3 = pd.merge(df1, df2, on='wave', how='outer')
df4 = df3.rename(columns = lambda x: x.split('_')[0]).T
df5 = df4.groupby(df4.index).mean().T
df6 = df5[['wave', 'num', 'stlines', 'fwhm', 'EWs', 'MeasredWave']]
df7 = df6.sort_values('wave', ascending = True).reset_index(drop=True)
df7