Question

我想使用dataframe框架将我以前的SAS代码改编为Python。在SAS中，我经常使用这种类型的代码（假设列按group_id排序，其中group_id取值为1到10，其中每个group_id有多个观察值）：

data want;set have;
by group_id;
if first.group_id then c=1; else c=0;
run;

所以这里发生的是我为每个ID选择了第一个观察结果，并创建了一个新变量c，其中包含值1和0。数据集如下所示：

如何使用dataframe在Python中执行此操作？假设我只从group_id向量开始。

Answer 1

如果您使用的是0.13+，则可以使用cumcount groupby方法：

In [11]: df
Out[11]: 
   group_id
0         1
1         1
2         1
3         2
4         2
5         2
6         3
7         3
8         3

In [12]: df.groupby('group_id').cumcount() == 0
Out[12]: 
0     True
1    False
2    False
3     True
4    False
5    False
6     True
7    False
8    False
dtype: bool

你可以强制dtype为int而不是bool：

In [13]: df['c'] = (df.groupby('group_id').cumcount() == 0).astype(int)

Python - 熊猫：选择每组的第一次观察

1 个答案: