将两个和更多lambda函数组合为一个

时间:2019-06-08 10:26:15

标签: python-3.x pandas dataframe lambda

下面是计算两个变量的移动平均值的代码,这些变量通过几个条件(DATE_G,ID1_G,ID_C_T)进行了过滤。这样的单位很多。是否可以将它们合并为一个操作?因为过滤是相同的,所以可以加快计算速度。

df['RES1_2Y'] = df.apply(
    lambda x: (df.loc[
        (
            (df.DATE_G < x.DATE_G)
            & (df.DATE_G >= (x.DATE_G + pd.DateOffset(days=-730)))
            & (df.ID1_G == x.ID1_G)
            & (df.ID_C_T == x.ID_C_T)
        ),
        "RES",
    ].mean()) if x.DATE_G > startdate else x.RES1_2Y,
    axis=1,
)


df['C1_2Y'] = df.apply(
    lambda x: (df.loc[
        (
            (df.DATE_G < x.DATE_G)
            & (df.DATE_G >= (x.DATE_G + pd.DateOffset(days=-730)))
            & (df.ID1_G == x.ID1_G)
            & (df.ID_C_T == x.ID_C_T)
        ),
        "S1",
    ].mean()) if x.DATE_G > startdate else x.C1_2Y,
    axis=1,
)

结果寓言(开始日期= 31.12.2018)

 DATE_G     ID1_G   ID_C_T      RES     S1      RES1_2Y     C1_2Y
01.01.2019      1       1       1       5               
01.01.2019      2       2       1       6               
01.01.2019      1       1       1       7       1.00        5.00
02.01.2019      2       2       0       5       1.00        6.00
03.01.2019      1       1       0       4       1.00        6.00
04.01.2019      2       2       1       6       0.50        5.50
04.01.2019      1       1       0       4       0.67        5.33
04.01.2019      2       2       1       6       0.67        5.67
05.01.2019      12      3       1       8               
06.01.2019      1       1       0       6       0.50        5.00
07.01.2019      2       2       0       5       0.75        5.75
08.01.2019      1       3       1       4               
09.01.2019      2       1       0       5               
10.01.2019      2       2       1       3       0.60        5.60
10.01.2019      2       3       0       5               
10.01.2019      2       1       0       6       0.00        5.00
10.01.2019      2       2       0       3       0.67        5.17

3 个答案:

答案 0 :(得分:2)

以下是您问题的直接答案(进行了次优化,将与startdate的日期比较移出了lambda函数)。

df_to_update = df[df.DATE_G > startdate].apply(
    lambda x: (df.loc[
        (
            (df.DATE_G < x.DATE_G)
            & (df.DATE_G >= (x.DATE_G + pd.DateOffset(days=-730)))
            & (df.ID1_G == x.ID1_G)
            & (df.ID_C_T == x.ID_C_T)
        ),
        ["RES", "S1"],
    ].mean()),
    axis=1,
)

df_to_update.columns = ["RES1_2Y", "C1_2Y"]
df.update(df_to_update)

答案 1 :(得分:1)

这对您有帮助吗?您需要用逻辑替换“#在这里计算您的值”。

i >= arr1.length
// and
j >= arr2.length

答案 2 :(得分:0)

这是使用groupbyrolling解决问题的另一种方法(在大型数据帧上应该更有效)。

start_date = pd.Timestamp("2018-12-31")
window_size = pd.offsets.Day(730)

group_cols = ["ID1_G", "ID_C_T", "DATE_G"]
dfg = df[df["DATE_G"] >= (start_date - window_size)].groupby(group_cols).agg({
   "DATE_G": "size", "RES": "sum", "S1": "sum"
})
dfg.columns = ["num_units", "RES_sum", "S1_sum"]  # Rename column names for clarity
dfg["date"] = dfg.index.get_level_values("DATE_G") # Repeat date values as a column for the rolling function

# Group by "ID1_G" and "ID_C_T", then compute time window statistics for each group
dfg_summary = dfg.groupby(["ID1_G", "ID_C_T"]).apply(
   lambda g: g.rolling(window_size, on="date", closed="left").sum()
)

# Compute rolling mean based on rolling sums and total number of units
dfg_summary = dfg_summary[["RES_sum", "S1_sum"]].div(dfg_summary["num_units"], axis=0)

# Join output with the original dataframe
df_to_update = df.join(dfg_summary, on=group_cols, how="inner")[["RES_sum", "S1_sum"]]

# Update the original dataframe
df_to_update.columns = ["RES1_2Y", "C1_2Y"]
df.update(df_to_update)

附带说明:如果熊猫基于时间的滚动统计为重复的时间戳提供更好的支持(请参见this issue),则解决方案将更加简单。