Question

我有这个数据框：

a = [1, 2, 3, 4, 5]
b = ['2019-08-01', '2019-09-01', '2019-10-23', '2019-11-12', '2019-11-30']
c = [12, 0, 0, 0, 0]
d = [0, 23, 0, 0, 0]
e = [12, 24, 35, 0, 0]
f = [0, 0, 44, 56, 82]
g = [21, 22, 17, 75, 63]

df = pd.DataFrame({'ID': a, 'Date': b, 'Unit_sold_8': c, 
                  'Unit_sold_9': d, 'Unit_sold_10': e, 'Unit_sold_11': f, 
                   'Unit_sold_12': g})
df['Date'] = pd.to_datetime(df['Date'])

我想基于日期计算每个ID的平均销售额。例如，如果ID的开放日期为9月，那么该ID的平均销售日期将为9月。我尝试了np.select，但我意识到这种方法会使我的代码超长。

col = df.columns

mask1 = (df['Date'] >= "08/01/2019") & (df['Date'] < "09/01/2019")
mask2 = (df['Date'] >= "09/01/2019") & (df['Date'] < "10/01/2019")
mask3 = (df['Date'] >= "10/01/2019") & (df['Date'] < "11/01/2019")
mask4 = (df['Date'] >= "11/01/2019") & (df['Date'] < "12/01/2019")
mask5 = (df['Date'] >= "12/01/2019")

condition2 = [mask1, mask2, mask3, mask4, mask5]
result2 = [df[col[2:]].mean(skipna = True, axis = 1),
          df[col[3:]].mean(skipna = True, axis = 1),
          df[col[4:]].mean(skipna = True, axis = 1),
          df[col[5:]].mean(skipna = True, axis = 1),
          df[col[6:]].mean(skipna = True, axis = 1)]
df.loc[:, 'Mean'] = np.select(condition2, result2, default = np.nan)

有没有更快的方法来解决这个问题？尤其是当时间范围扩大时（12个月，24个月等）

Answer 1

对您有帮助吗？

from datetime import datetime
import numpy as np
from dateutil import relativedelta


check_date = datetime.today()
df['n_months'] = df['Date'].apply(lambda x: relativedelta.relativedelta( check_date,x).months)
df['total'] = df.iloc[:,range(2,df.shape[1]-1)].sum(axis=1)
df['avg']  = df['total'] / df['n_months']

print(df)

   ID       Date  Unit_sold_8  ...  n_months  total    avg
0   1 2019-08-01           12  ...         5     45   9.00
1   2 2019-09-01            0  ...         4     69  17.25
2   3 2019-10-23            0  ...         3     96  32.00
3   4 2019-11-12            0  ...         2    131  65.50
4   5 2019-11-30            0  ...         2    145  72.50

Answer 2

M= df 
   #melt data to pull units as variables

 .melt(id_vars=['ID','Date'])

   #create temp variables to pull out Month from Date and Units

 .assign(Mth=lambda x: x['Date'].dt.month, 
         oda_detail = lambda x: x.variable.str.split('_').str[-1])
 .sort_values(['ID','Mth'])

  #keep only rows where the Mth is less than or equal to other detail

 .loc[lambda x : x['Mth'].astype(int).le(x['oda_detail'].astype(int))]

  #groupby and get the mean

 .groupby(['ID','Date'])['value'].mean()
 .reset_index()
 .drop(['ID','Date'],axis=1)
 .rename({'value':'Mean'},axis=1)

重新加入原始数据框：

pd.concat([df,M],axis=1)

ID  Date    Unit_sold_8 Unit_sold_9 Unit_sold_10    Unit_sold_11     
 Unit_sold_12   Mean
0   1   2019-08-01  12  0   12  0   21  9.00
1   2   2019-09-01  0   23  24  0   22  17.25
2   3   2019-10-23  0   0   35  44  17  32.00
3   4   2019-11-12  0   0   0   56  75  65.50
4   5   2019-11-30  0   0   0   82  63  72.50

根据日期熊猫计算

2 个答案: