Question

我有一个包含3列的数据框time：id（时间戳），red（str）和new_col（布尔值）。我想添加另一个布尔列，每列检查此行或此id的任何按时间顺序排列的下两行是否为红色。（如果在此行之后有少于两行相同的id，我们只考虑我们拥有的行。）

这样做的优雅方法是什么？ 我的做法并不优雅：我按时间排序，创建了一个名为df的空列表，并在for row_number in xrange(len(df)-2)...的所有行中循环填充：

（iloc）

使用df['col']=new_col然后输入<a href="../../tree/master">link to top</a>。这很慢，而且不太可读。

Answer 1

假设您首先按时间戳排序，您可以按ID进行分组，对于每个组，将red的值移动一次和两次，然后找到逻辑或结果：

 df['col'] = df.red.groupby(df.id).apply(lambda g: g | g.shift(-1) | g.shift(-2))

例如：

In [100]: df = pd.DataFrame({'red': [True, True, True, False, False, True, True, True], 'id': [0] * 6 + [1] * 2})

In [101]: df.red.groupby(df.id).apply(lambda g: g | g.shift(-1) | g.shift(-2))
Out[101]: 
0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
Name: red, dtype: bool

Answer 2

我同意Ami的意见，我认为您只想检查后续行是否为红色/非红色，因此我会删除OR中的第一个groupby语句：

# df1 (original df)
#   id    red        time
# 0  1   True  2016-09-01
# 1  1   True  2016-09-02
# 2  1   True  2016-09-03
# 3  2   True  2016-09-02
# 4  3  False  2016-09-03
# 5  4  False  2016-09-04
# 6  5  False  2016-09-05

df2 = df1.groupby(['id'])['red'].apply(lambda g: g.shift(-1) | g.shift(-2)).reset_index()
df2.drop(labels='index', axis=1, inplace=True)
df2.rename(columns={0: 'next red'}, inplace=True)
df1.join(other=df2)

输出：

  id    red        time next red
0  1   True  2016-09-01     True
1  1   True  2016-09-02     True
2  1   True  2016-09-03    False
3  2   True  2016-09-02    False
4  3  False  2016-09-03    False
5  4  False  2016-09-04    False
6  5  False  2016-09-05    False

根据现有列的相邻值向Pandas数据框添加列

2 个答案: