pandas数据帧中前N行的条件均值和总和

时间:2017-10-13 05:51:06

标签: python pandas dataframe rolling-computation rolling-sum

关注的是这个示范性的熊猫数据帧:

      Measurement  Trigger  Valid
   0          2.0    False   True
   1          4.0    False   True
   2          3.0    False   True
   3          0.0     True  False
   4        100.0    False   True
   5          3.0    False   True
   6          2.0    False   True
   7          1.0     True   True

每当TriggerTrue时,我希望计算最后3个(从当前开始)有效测量的总和和平均值。如果列ValidTrue,则度量被视为有效。因此,让我们使用上述数据框中的两个示例进行澄清:

  1. Index 3:应使用索引2,1,0。预期Sum = 9.0, Mean = 3.0
  2. Index 7:应使用索引7,6,5。预期Sum = 6.0, Mean = 2.0
  3. 我已尝试pandas.rolling并创建新的移位列,但未成功。请参阅以下我的测试摘录(应该直接运行):

    import unittest
    import pandas as pd
    import numpy as np
    from pandas.util.testing import assert_series_equal
    
    def create_sample_dataframe_2():
        df = pd.DataFrame(
            {"Measurement" : [2.0,   4.0,   3.0,   0.0,   100.0, 3.0,   2.0,   1.0 ],
             "Valid"       : [True,  True,  True,  False, True,  True,  True,  True],
             "Trigger"     : [False, False, False, True,  False, False, False, True],
             })
        return df
    
    def expected_result():
        return pd.DataFrame({"Sum" : [np.nan, np.nan, np.nan, 9.0, np.nan, np.nan, np.nan, 6.0],
                             "Mean" :[np.nan, np.nan, np.nan, 3.0, np.nan, np.nan, np.nan, 2.0]})
    
    class Data_Preparation_Functions(unittest.TestCase):
    
        def test_backsummation(self):
            N_SUMMANDS = 3
            temp_vars = []
    
            df = create_sample_dataframe_2()
            for i in range(0,N_SUMMANDS):
                temp_var = "M_{0}".format(i)
                df[temp_var] = df["Measurement"].shift(i)
                temp_vars.append(temp_var)
    
            df["Sum"]  = df[temp_vars].sum(axis=1)
            df["Mean"] = df[temp_vars].mean(axis=1)
            df.loc[(df["Trigger"]==False), "Sum"] = np.nan
            df.loc[(df["Trigger"]==False), "Mean"] = np.nan
    
            assert_series_equal(expected_result()["Sum"],df["Sum"])
            assert_series_equal(expected_result()["Mean"],df["Mean"])
    
        def test_rolling(self):
            df = create_sample_dataframe_2()
            df["Sum"]  = df[(df["Valid"] == True)]["Measurement"].rolling(window=3).sum()
            df["Mean"] = df[(df["Valid"] == True)]["Measurement"].rolling(window=3).mean()
    
            df.loc[(df["Trigger"]==False), "Sum"] = np.nan
            df.loc[(df["Trigger"]==False), "Mean"] = np.nan
            assert_series_equal(expected_result()["Sum"],df["Sum"])
            assert_series_equal(expected_result()["Mean"],df["Mean"])
    
    
    if __name__ == '__main__':
        suite = unittest.TestLoader().loadTestsFromTestCase(Data_Preparation_Functions)
        unittest.TextTestRunner(verbosity=2).run(suite)
    

    非常感谢任何帮助或解决方案。谢谢,干杯!

    编辑:澄清:这是我期望的结果数据框:

          Measurement  Trigger  Valid   Sum   Mean
       0          2.0    False   True   NaN    NaN
       1          4.0    False   True   NaN    NaN
       2          3.0    False   True   NaN    NaN
       3          0.0     True  False   9.0    3.0
       4        100.0    False   True   NaN    NaN
       5          3.0    False   True   NaN    NaN
       6          2.0    False   True   NaN    NaN
       7          1.0     True   True   6.0    2.0
    

    EDIT2:另一个澄清:

    我确实没有错误估计,而是我没有尽可能明确地表达我的意图。以下是使用相同数据框的另一种尝试:

    Desired dataframe, relevant fields highlighted

    让我们首先看一下Trigger列:我们在索引3(绿色矩形)中找到第一个True。所以索引3是我们开始寻找的点。索引3处没有有效的度量(列ValidFalse;红色矩形)。所以,我们开始回到过去,直到我们累积了三行,其中ValidTrue。对于索引2,1和0,会发生这种情况。对于这三个索引,我们计算列Measurement(蓝色矩形)的总和和平均值:

    • SUM:2.0 + 4.0 + 3.0 = 9.0
    • 意思:(2.0 + 4.0 + 3.0)/ 3 = 3.0

    现在我们开始这个小算法的下一次迭代:再次查看True列中的下一个Trigger。我们在索引7(绿色矩形)找到它。在索引7处还有一个有效的度量标准,所以我们这次包括它。对于我们的计算,我们使用索引7,6和5(绿色矩形),因此得到:

    • SUM:1.0 + 2.0 + 3.0 = 6.0
    • MEAN:(1.0 + 2.0 + 3.0)/ 3 = 2.0

    我希望,这会对这个小问题有所了解。

1 个答案:

答案 0 :(得分:4)

继承人选择,采取3期滚动均值和总和

df['RollM'] = df.Measurement.rolling(window=3,min_periods=0).mean()

df['RollS'] = df.Measurement.rolling(window=3,min_periods=0).sum()

现在设置False触发器等于NaN

df.loc[df.Trigger == False,['RollS','RollM']] = np.nan

产量

   Measurement  Trigger  Valid     RollM  RollS
0          2.0    False   True       NaN    NaN
1          4.0    False   True       NaN    NaN
2          3.0    False   True       NaN    NaN
3          0.0     True  False  2.333333    7.0
4        100.0    False   True       NaN    NaN
5          3.0    False   True       NaN    NaN
6          2.0    False   True       NaN    NaN
7          1.0     True   True  2.000000    6.0

编辑,更新以反映有效参数

df['mean'],df['sum'] = np.nan,np.nan

roller = df.Measurement.rolling(window=3,min_periods=0).agg(['mean','sum'])

df.loc[(df.Trigger == True) & (df.Valid == True),['mean','sum']] = roller

df.loc[(df.Trigger == True) & (df.Valid == False),['mean','sum']] = roller.shift(1)

产量

  Measurement  Trigger  Valid  mean  sum
0          2.0    False   True   NaN  NaN
1          4.0    False   True   NaN  NaN
2          3.0    False   True   NaN  NaN
3          0.0     True  False   3.0  9.0
4        100.0    False   True   NaN  NaN
5          3.0    False   True   NaN  NaN
6          2.0    False   True   NaN  NaN
7          1.0     True   True   2.0  6.0