pandas:有效地应用用作整个数据帧的输入的函数

时间:2018-02-09 10:37:47

标签: python pandas time-series feature-extraction

我有一个熊猫数据框,可根据日期对产品购买进行建模。我想添加昨天,上周发生了多少次购买的功能等。有一种优雅而有效的方法吗?现在我正在做一个循环,这需要很多时间..

鉴于数据:

one_day = pd.to_timedelta(1, unit='d')
two_days = pd.to_timedelta(2, unit='d')

yesterday_sales, last_two_days_sales = [], []
for _, row in df.iterrows():
    yesterday_performance = df.loc[(df["product"] == row["product"]) & (df.dates == (row["dates"]-one_day)) ]
    if yesterday_performance.shape[0] == 1:
        yesterday_sales.append(yesterday_performance.sales.values[0])
    else:
        yesterday_sales.append(-1)

    two_days_sales = df.loc[(df["product"] == row["product"]) & (df["dates"] >= (row["dates"]-two_days)) & (df["dates"] < (row["dates"]))]
    if two_days_sales.shape[0] >= 1:
        last_two_days_sales.append(two_days_sales.sales.sum())
    else:
        last_two_days_sales.append(-1)

df["yesterday_sales"] = yesterday_sales
df["last_two_days_sales"] = last_two_days_sales

获取前几天的销售额和前两天的销售总额我循环:

div#bootmPanel >div {
      vertical-align: middle;
 }

循环中的所有内容都很耗时,但我想不出更好的方法。

1 个答案:

答案 0 :(得分:1)

我简化了你的代码。它仍然没有矢量化,但如果性能不是问题,它应该更容易维护:

print (df)
   experiment_a  experiment_b
0  EXPT_2011_03           NaN
1           NaN  EXPT_2009_08
2           NaN  EXPT_2010_06
3  EXPT_2010_07           NaN
4           NaN  EXPT_2011_07

#[500000 rows x 2 columns]
df = pd.concat([df] * 100000, ignore_index=True)

In [41]: %timeit (df.iloc[(np.where(df['experiment_a'].isnull(), df['experiment_b'], df['experiment_a'])).argsort()])
1 loop, best of 3: 318 ms per loop

In [42]: %timeit (df.iloc[df['experiment_a'].fillna(df['experiment_b']).argsort()])
1 loop, best of 3: 335 ms per loop

In [43]: %timeit (df.iloc[df['experiment_a'].combine_first(df['experiment_b']).argsort()])
1 loop, best of 3: 333 ms per loop

In [44]: %timeit (df.iloc[df.experiment_a.where(df.experiment_a.notnull(), df.experiment_b).argsort()])
1 loop, best of 3: 342 ms per loop