Question

我有一个像这样的数据集：

events2 <- events %>% mutate(Col = apply(select(., ends_with("(In Hospital)")), 1, max))
events2$Col 
# [1] 1 1 1 1 1

我想在ID的状态变为1后删除ID的所有行，即我的新数据集将是：

Id   Status

1     0
1     0
1     0
1     0
1     1
2     0
1     0
2     0
3     0
3     0

由于我有非常大的数据集（超过200 GB），因此如何有效地实现它。

感谢您的帮助。

Answer 1

编辑：一个月后再问这个问题，实际上groupby和cumsum的使用方法要简单得多：只需将Id分组，然后cumsum中的Status，然后将cumsum大于0的值删除：

df[df.groupby('Id')['Status'].cumsum() < 1]

我发现的最好方法是找到第一个1的索引，然后以这种方式切片每个组。如果不存在1，则返回该组不变：

def remove(series):
    indexless = series.reset_index(drop=True)
    ones = indexless[indexless['Status'] == 1]
    if len(ones) > 0:
        return indexless.iloc[:ones.index[0] + 1]

    else:
        return indexless

df.groupby('Id').apply(remove).reset_index(drop=True)

输出：

   Id  Status
0   1       0
1   1       0
2   1       0
3   1       0
4   1       1
5   2       0
6   2       0
7   3       0
8   3       0

Answer 2

这是个主意；

您可以使用第一个索引创建字典，其中每个ID的状态均为1（假设DataFrame按ID排序）：

d = df.loc[df["Status"]==1].drop_duplicates()
d = dict(zip(d["Id"], d.index))

然后您为每个ID创建一个带有第一个status=1的列：

df["first"] = df["Id"].map(d)

最后，删除索引小于first列的每一行：

df = df.loc[df.index<df["first"]]

Answer 3

将groupby与cumsum一起使用，以查找状态为1。

res = df.groupby('Id', group_keys=False).apply(lambda x: x[x.Status.cumsum() > 0])
res

    Id  Status
4   1   1
6   1   0

排除索引Status==0。

not_select_id = res[res.Status==0].index

df[~df.index.isin(not_select_id)]

Id  Status
0   1   0
1   1   0
2   1   0
3   1   0
4   1   1
5   2   0
7   2   0
8   3   0
9   3   0

将ID的行拖放到Pandas中的特定列值之后

3 个答案: