Question

假设我具有以下DataFrame：

df = pd.DataFrame({'Event': ['A', 'B', 'A', 'A', 'B', 'C', 'B', 'B', 'A', 'C'], 
                    'Date': ['2019-01-01', '2019-02-01', '2019-03-01', '2019-03-01', '2019-02-15', 
                             '2019-03-15', '2019-04-05', '2019-04-05', '2019-04-15', '2019-06-10'],
                    'Sale':[100,200,150,200,150,100,300,250,500,400]})
df['Date'] = pd.to_datetime(df['Date'])
df

Event         Date
    A   2019-01-01
    B   2019-02-01
    A   2019-03-01
    A   2019-03-01
    B   2019-02-15
    C   2019-03-15
    B   2019-04-05
    B   2019-04-05
    A   2019-04-15
    C   2019-06-10

我想获得以下结果：

Event         Date  Previous_Event_Count
    A   2019-01-01                     0
    B   2019-02-01                     0
    A   2019-03-01                     1
    A   2019-03-01                     1
    B   2019-02-15                     1
    C   2019-03-15                     0
    B   2019-04-05                     2
    B   2019-04-05                     2
    A   2019-04-15                     3
    C   2019-06-10                     1

其中df['Previous_Event_Count']是事件（df['Event']）在其相邻日期（df['Date']）之前发生的事件（行）的编号。例如，

2019年1月1日之前发生的事件A的数量为0，
在2019-03-01之前发生的事件A的数量为1，并且
A事件发生在2019-04-15之前的数目是3。

我可以使用此行获得所需的结果：

df['Previous_Event_Count'] = [df.loc[(df.loc[i, 'Event'] == df['Event']) & (df.loc[i, 'Date'] > df['Date']), 
                                     'Date'].count() for i in range(len(df))]

虽然速度很慢，但是效果很好。我相信有更好的方法可以做到这一点。我已经尝试过这一行：

df['Previous_Event_Count'] = df.query('Date < Date').groupby(['Event', 'Date']).cumcount()

但是会产生NaNs。

Answer 1

`<group_with_dots_as_file_separator>/<module>/<version>/<module>-<version>.<ext>` + https://github.com/dgrtwo/fuzzyjoin/issues/50

日期可以视为数字。使用groupby获取计数逻辑。

'min'

Answer 2

首先在两列中按GroupBy.size获取计数，然后按第一级汇总shift和累积总和，最后join汇总为原始计数：

s = (df.groupby(['Event', 'Date'])
       .size()
       .groupby(level=0)
       .apply(lambda x: x.shift(1).cumsum())
       .fillna(0)
       .astype(int))

df = df.join(s.rename('Previous_Event_Count'), on=['Event','Date'])
print (df)
  Event       Date  Previous_Event_Count
0     A 2019-01-01                     0
1     B 2019-02-01                     0
2     A 2019-03-01                     1
3     A 2019-03-01                     1
4     B 2019-02-15                     1
5     C 2019-03-15                     0
6     B 2019-04-05                     2
7     B 2019-04-05                     2
8     A 2019-04-15                     3
9     C 2019-06-10                     1

Answer 3

最后，我可以找到一种更好，更快的方法来获得所需的结果。事实证明，这很容易。一个可以尝试：

df['Total_Previous_Sale'] = df.groupby('Event').cumcount() \
                          - df.groupby(['Event', 'Date']).cumcount()

熊猫中所有先前行的有条件运行计数

3 个答案:

`<group_with_dots_as_file_separator>/<module>/<version>/<module>-<version>.<ext>` + https://github.com/dgrtwo/fuzzyjoin/issues/50

熊猫中所有先前行的有条件运行计数

3 个答案:

<group_with_dots_as_file_separator>/<module>/<version>/<module>-<version>.<ext> + https://github.com/dgrtwo/fuzzyjoin/issues/50

`<group_with_dots_as_file_separator>/<module>/<version>/<module>-<version>.<ext>` + https://github.com/dgrtwo/fuzzyjoin/issues/50