在数据框字典上生成平均值

时间:2018-10-17 17:04:39

标签: python pandas dataframe

我有以下熊猫数据框:

phreatic_level_l2n1_28w_df.head()
       Fecha    Hora    PORVL2N1  # PORVLxNx column change their name in each data frame
0   2012-01-12  01:37:47    0.65
1   2012-01-12  02:37:45    0.65
2   2012-01-12  03:37:50    0.64
3   2012-01-12  04:37:44    0.63
4   2012-01-12  05:37:45    0.61

phreatic_level_l2n2_28w_df.head()
       Fecha    Hora    PORVL2N2 # PORVLxNx column change their name in each data frame
0   2018-01-12  01:58:22    0.71
1   2018-01-12  02:58:22    0.71
2   2018-01-12  03:58:23    0.71
3   2018-01-12  04:58:23    0.71
4   2018-01-12  05:58:24    0.71

phreatic_level_l4n1_28w_df.head()
       Fecha    Hora    PORVL4N1 # PORVLxNx column change their name in each data frame
0   2018-01-12  01:28:49    0.96
1   2018-01-12  02:28:49    0.96
2   2018-01-12  03:28:50    0.96
3   2018-01-12  04:28:52    0.95
4   2018-01-12  05:28:48    0.94

如此,直到有25个phreatic_level_l24n2_28w_df类型的数据帧为止

.
.
.
phreatic_level_l24n2_28w_df.head()
       Fecha    Hora    PORVL24N2 # PORVLxNx column change their name in each data frame
0   2018-01-12  01:07:28    1.31
1   2018-01-12  02:07:28    1.31
2   2018-01-12  03:07:29    1.31
3   2018-01-12  04:07:27    1.31
4   2018-01-12  05:07:27    1.31

每一行包含 PORVLxNx 列中的数据帧,该数据帧具有从Fecha至{{1} },每天都有许多 2018-01-12

2018-08-03

我的目标是获取每个数据框,并每天生成平均 PORVLxNx ,如下所示:     Fecha PORVL2N1     0 2018年1月12日0.519130     1 2018年1月13日0.138750     2 2018-01-14 0.175417     3 2018年1月15日0.111667     4 2018-01-16 0.291250

我采用以下方法:

我将phreatic_level_l24n2_28w_df.tail() Fecha Hora PORVL24N2 4875 2018-08-03 20:31:01 1.15 4876 2018-08-03 21:31:00 1.15 4877 2018-08-03 22:31:01 1.16 4878 2018-08-03 23:31:02 1.17 4879 NaN NaN NaN 放在字典中,并使用字符串引用了它:

PORVLxNx

我正在遍历数据帧(此刻刚过DataFrame

dfs = {
    'phreatic_level_l2n1_28w_df': phreatic_level_l2n1_28w_df,
    # FOR THE MOMENT I ONLY TEST with the first dataframe 

    # 'phreatic_level_l2n2_28w_df': phreatic_level_l2n2_28w_df,
    # 'phreatic_level_l4n1_28w_df': phreatic_level_l4n1_28w_df,
    # 'phreatic_level_l5n1_28w_df': phreatic_level_l5n1_28w_df,
    # 'phreatic_level_l6n1_28w_df': phreatic_level_l6n1_28w_df,
    # 'phreatic_level_l7n1_28w_df': phreatic_level_l7n1_28w_df,
    # 'phreatic_level_l8n1_28w_df': phreatic_level_l8n1_28w_df,
    # 'phreatic_level_l9n1_28w_df': phreatic_level_l9n1_28w_df,
    # 'phreatic_level_l10n1_28w_df': phreatic_level_l10n1_28w_df,
    # 'phreatic_level_l13n1_28w_df': phreatic_level_l13n1_28w_df,
    # 'phreatic_level_l14n1_28w_df': phreatic_level_l14n1_28w_df,
    # 'phreatic_level_l15n1_28w_df': phreatic_level_l15n1_28w_df,
    # 'phreatic_level_l16n1_28w_df': phreatic_level_l16n1_28w_df,
    # 'phreatic_level_l16n2_28w_df': phreatic_level_l16n2_28w_df,
    # 'phreatic_level_l18n1_28w_df': phreatic_level_l18n1_28w_df,
    # 'phreatic_level_l18n2_28w_df': phreatic_level_l18n2_28w_df,
    # 'phreatic_level_l18n3_28w_df': phreatic_level_l18n3_28w_df,
    # 'phreatic_level_l18n4_28w_df': phreatic_level_l18n4_28w_df,
    # 'phreatic_level_l21n1_28w_df': phreatic_level_l21n1_28w_df,
    # 'phreatic_level_l21n2_28w_df': phreatic_level_l21n2_28w_df,
    # 'phreatic_level_l21n3_28w_df': phreatic_level_l21n3_28w_df,
    # 'phreatic_level_l21n4_28w_df': phreatic_level_l21n4_28w_df,
    # 'phreatic_level_l21n5_28w_df': phreatic_level_l21n5_28w_df,
    # 'phreatic_level_l24n1_28w_df': phreatic_level_l24n1_28w_df,
    # 'phreatic_level_l24n2_28w_df': phreatic_level_l24n2_28w_df  

}

我的phreatic_level_l2n1_28w_df的输出是:

for name, df in dfs.items():
    # We turn to datetime the Fecha column values 
    df['Fecha'] = pd.to_datetime(df['Fecha'])

    # I am iterating over each *`PORVLxNx`* column
    for i in range(1,24):
        if(i==2):
            # To N1
            l2_n1_average_per_day = (df.groupby(pd.Grouper(key='Fecha', freq='D'))['PORVL{}N{}'.format(i,i-1)].mean().reset_index())
            l2_n1_average_per_day.to_csv('L{}N{}_average_per-day.csv'.format(i,i-1), sep=',', header=True, index=False)
            print(l2_n1_average_per_day.head()) 

直到这里,我的想法还是可行的。

当我想将此解决方案(很有可能没有更好的解决方案)应用于我的l2_n1_average_per_day.head()词典中包含的其他数据帧时

    Fecha  PORVL2N1
0 2018-01-12  0.519130
1 2018-01-13  0.138750
2 2018-01-14  0.175417
3 2018-01-15  0.111667
4 2018-01-16  0.291250

l2_n1_average_per_day.tail()

        Fecha  PORVL2N1
199 2018-07-30  0.630417
200 2018-07-31  0.609583
201 2018-08-01  0.533333
202 2018-08-02  0.470833
203 2018-08-03  0.713333

我再次迭代 ...

dfs

在我的输出中,找不到dfs = { 'phreatic_level_l2n1_28w_df': phreatic_level_l2n1_28w_df, 'phreatic_level_l2n2_28w_df': phreatic_level_l2n2_28w_df, # I've added the L2N2 phreatic_level_l2n2_28w_df dataframe item }

for name, df in dfs.items():
    df['Fecha'] = pd.to_datetime(df['Fecha'])
    for i in range(1,24):
        if(i==2):
            # To N1
            l2_n1_average_per_day = (df.groupby(pd.Grouper(key='Fecha', freq='D'))['PORVL{}N{}'.format(i,i-1)].mean().reset_index())
            l2_n1_average_per_day.to_csv('L{}N{}_average_per-day.csv'.format(i,i-1), sep=',', header=True, index=False)

            # To N2. I've generate the average per day to L2N2

            l2_n2_average_per_day = (df.groupby(pd.Grouper(key='Fecha', freq='D'))['PORVL{}N{}'.format(i,i)].mean().reset_index())
            l2_n2_average_per_day.to_csv('L{}N{}_average_per-day.csv'.format(i,i), sep=',', header=True, index=False)

这很奇怪,因为在字典中的数据帧中,经过迭代的我有PORVL2N2

----------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-161-fbe6eaf8a824> in <module>()
     11             print(phreatic_level_l2_n1_average_per_day.tail())
     12             # To N2
---> 13             phreatic_level_l2_n2_average_per_day = (df.groupby(pd.Grouper(key='Fecha', freq='D'))['PORVL{}N{}'.format(i,i)].mean().reset_index())
     14             phreatic_level_l2_n2_average_per_day.to_csv('L{}N{}_average_per-day.csv'.format(i,i), sep=',', header=True, index=False)
     15 

~/anaconda3/envs/sioma/lib/python3.6/site-packages/pandas/core/base.py in __getitem__(self, key)
    265         else:
    266             if key not in self.obj:
--> 267                 raise KeyError("Column not found: {key}".format(key=key))
    268             return self._gotitem(key, ndim=1)
    269 

KeyError: 'Column not found: PORVL2N2'

是否有可能在我的迭代中重写数据帧或发生其他事情?

2 个答案:

答案 0 :(得分:2)

您的数据帧似乎具有良好且一致的结构,因此您可以做的是使用PORVLxNy获取要meandf.columns获取[-1]的列的名称,然后最后一个元素for name, df in dfs.items(): df['Fecha'] = pd.to_datetime(df['Fecha']) col = df.columns[-1] #here col = PORVLxNx with the right x depending on df # no need of loop for anymore lx_ny_average_per_day = (df.groupby(pd.Grouper(key='Fecha', freq='D'))[col] .mean().reset_index()) lx_ny_average_per_day.to_csv( '{}_average_per-day.csv'.format(col[-4:]), sep=',', header=True, index=False) 。然后将结果保存到名称正确的csv文件中,您只需保留列名的后4个字符即​​可:

Summary: 
Fix: 
Impact:
Testing:
Unit Testing: 
Documentation:
QA: 
Localization: 
Jira-Id:

答案 1 :(得分:2)

我同意@ Ben.T仅在使用数据框的列df.columns[-1]的最后一个条目进行索引的前提下,假设您的数据框的结构适合于此。
如果没有,另一种方法是仅使用字典键的相应子字符串进行索引:

'PORV{}'.format(name.split('_')[2].upper())

或者简单地

'PORV' + name.split('_')[2].upper()

但是,如果您使用groupby(即日期)作为索引将Series的右列提取为Fecha,则IMO也可以简化sr = df.set_index('Fecha')['PORVL2N1'] # for indexing, the same like above applies again here sr.index = pd.to_datetime(sr.index) avg_per_day = sr.resample('D').mean() 部分,这使您可以使用重采样功能,可以按照您想要的方式精确地对基于时间的数据进行分组:

NEW_PASSWORD_REQUIRED