Question

我有一个看起来像这样的DataFrame：

       | event_type | object_id
------ | ------     | ------
0      | A          | 1
1      | D          | 1
2      | A          | 1
3      | D          | 1
4      | A          | 2
5      | A          | 2
6      | D          | 2
7      | A          | 3
8      | D          | 3
9      | A          | 3

我想要做的是获取event_type为A且object_id仍然相同的下一行的索引，因此作为附加列，这将如下所示：

       | event_type | object_id | next_A
------ | ------     | ------    | ------
0      | A          | 1         | 2
1      | D          | 1         | 2
2      | A          | 1         | NaN
3      | D          | 1         | NaN
4      | A          | 2         | 5
5      | A          | 2         | NaN
6      | D          | 2         | NaN
7      | A          | 3         | 9
8      | D          | 3         | 9
9      | A          | 3         | NaN

等等。

我想避免使用.apply()，因为我的DataFrame非常大，是否有矢量化方法来执行此操作？

编辑：对于同一object_id的多个A / D对，我希望它始终使用A的下一个索引，如下所示：

       | event_type | object_id | next_A
------ | ------     | ------    | ------
0      | A          | 1         | 2
1      | D          | 1         | 2
2      | A          | 1         | 4
3      | D          | 1         | 4
4      | A          | 1         | NaN

Answer 1

您可以使用groupby执行此操作：

def populate_next_a(object_df):
    object_df['a_index'] = pd.Series(object_df.index, index=object_df.index)[object_df.event_type == 'A']
    object_df['a_index'].fillna(method='bfill', inplace=True)
    object_df['next_A'] = object_df['a_index'].where(object_df.event_type != 'A', object_df['a_index'].shift(-1))
    object_df.drop('a_index', axis=1)
    return object_df
result = df.groupby(['object_id']).apply(populate_next_a)
print(result)
  event_type  object_id  next_A
0          A          1     2.0
1          D          1     2.0
2          A          1     NaN
3          D          1     NaN
4          A          2     5.0
5          A          2     NaN
6          D          2     NaN
7          A          3     9.0
8          D          3     9.0
9          A          3     NaN

GroupBy.apply的开销不会像简单的申请那么多。

注意你不能（还）存储NaN的整数：http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na所以它们最终成为浮动值

从满足pandas条件的行中获取下一个值

1 个答案: