Question

我有一个像这样的数据集：

    Id   Status

    1     0
    1     0
    1     0
    1     0
    1     1
    2     0
    1     0 # --> gets removed since this row appears after id 1 already had a status of 1
    2     0
    3     0
    3     0

我想在ID的状态变为1后删除ID的所有行，即我的新数据集将是：

    Id   Status

    1     0
    1     0
    1     0
    1     0
    1     1
    2     0
    2     0
    3     0
    3     0

由于我有一个非常大的数据集（超过200 GB），因此我想学习如何有效地执行此计算。

我目前的解决方案是找到第一个1的索引，然后以这种方式对每个组进行切片。如果不存在1，则返回该组不变：

def remove(series):
    indexless = series.reset_index(drop=True)
    ones = indexless[indexless['Status'] == 1]
    if len(ones) > 0:
        return indexless.iloc[:ones.index[0] + 1]

    else:
        return indexless

df.groupby('Id').apply(remove).reset_index(drop=True)

但是，运行速度非常慢，有什么方法可以解决此问题或加快计算速度吗？

Answer 1

第一个想法是使用布尔掩码创建每组的累积总和，但是为避免先丢失shift也是必要的1：

#pandas 0.24+
s = (df['Status'] == 1).groupby(df['Id']).apply(lambda x: x.shift(fill_value=0).cumsum())
#pandas below
#s = (df['Status'] == 1).groupby(df['Id']).apply(lambda x: x.shift().fillna(0).cumsum())
df = df[s == 0]
print (df)
   Id  Status
0   1       0
1   1       0
2   1       0
3   1       0
4   1       1
5   2       0
7   2       0
8   3       0
9   3       0

另一种解决方案是对Series.idxmax使用自定义lambda函数：

def f(x):
    if x['new'].any():
        return x.iloc[:x['new'].idxmax()+1, :]
    else:
        return x

df1 = (df.assign(new=(df['Status'] == 1))
        .groupby(df['Id'], group_keys=False)
        .apply(f).drop('new', axis=1))
print (df1)
    Id  Status
0    1       0
1    1       0
2    1       0
3    1       0
4    1       1
5    2       0
8    2       0
9    3       0
10   3       0

或者稍微修改后的第一个解决方案-仅过滤带有1的组，并仅在其中应用溶剂化：

m = df['Status'].eq(1)
ids = df.loc[m, 'Id'].unique()
print (ids)
[1]

m1 = df['Id'].isin(m)
m2 = (m[m1].groupby(df['Id'])
            .apply(lambda x: x.shift(fill_value=0).cumsum())
            .eq(0))

df = df[m2.reindex(df.index, fill_value=True)]
print (df)
    Id  Status
0    1       0
1    1       0
2    1       0
3    1       0
4    1       1
5    2       0
8    2       0
9    3       0
10   3       0

Answer 2

让我们从这个数据集开始。

l =[[1,0],[1,0],[1,0],[1,0],[1,1],[2,0],[1,0], [2,0], [2,1],[3,0],[2,0], [3,0]]
df_ = pd.DataFrame(l, columns = ['id', 'status'])

我们将为每个ID找到status = 1索引。

status_1_indice = df_[df_['status']==1].reset_index()[['index', 'id']].set_index('id')

    index
id  
1   4
2   8

现在我们通过df_和status_1_indice一起加入

join_table  = df_.join(status_1_indice, on='id').reset_index().fillna(np.inf)

.fillna(np.inf)通知ID的状态不为1。结果：

    level_0 id  status  index
0   0   1   0   4.000000
1   1   1   0   4.000000
2   2   1   0   4.000000
3   3   1   0   4.000000
4   4   1   1   4.000000
5   5   2   0   8.000000
6   6   1   0   4.000000
7   7   2   0   8.000000
8   8   2   1   8.000000
9   9   3   0   inf
10  10  2   0   8.000000
11  11  3   0   inf

所需的数据帧可以通过以下方式获得：

join_table.query('level_0 <= index')[['id', 'status']]

一起：

status_1_indice = df_[df_['status']==1].reset_index()[['index', 'id']].set_index('id')
join_table  = df_.join(status_1_indice, on='id').reset_index().fillna(np.inf)
required_df = join_table.query('level_0 <= index')[['id', 'status']]


   id   status
0   1   0
1   1   0
2   1   0
3   1   0
4   1   1
5   2   0
7   2   0
8   2   1
9   3   0
11  3   0

我不能保证性能，但是比有问题的方法更直接。

有效地删除熊猫数据框中的行

2 个答案: