Question

我有4个不同的df，分别为：X，step25，step26和step27

X是我的主要df，其形状为（155854，4），而其他3个df则是根据X数据帧创建的，如下所示：

X = data.loc[:, ['ContextID', 'BacksGas_Flow_sccm', 'StepID', 'Time_ms', 'Time_Elapsed']]
step25 = pd.DataFrame(columns=['ContextID', 'BacksGas_Flow_sccm', 'StepID', 'Time_ms'])
step26 = step25.copy()
step27 = step25.copy()

for _, group in df.groupby('ContextID'):
    step25 = step25.append(group[group.index.get_loc(group[group.StepID.eq(24)].index[0]):][group.StepID.eq(1)])
    step26 = step26.append(group[group.index.get_loc(group[group.StepID.eq(24)].index[0]):][group.StepID.eq(2)])
    step27 = step27.append(group[group.index.get_loc(group[group.StepID.eq(24)].index[0]):][group.StepID.eq(3)])

这给了我另外3个df，它们的形状是：

step25 (2978, 5)
step26 (4926, 5)
step27 (11810, 5)

所有这三个df都有一个名为StepID的列，其值分别为1、2、3，因此，我将它们分别替换为25、26、27，然后将所有df X串联在一起，step25，step26和step27如下：

step25['StepID'] = 25
step26['StepID'] = 26
step27['StepID'] = 27
united_data = pd.concat([X, step25, step26, step27], sort=True)

现在，united_data中的值具有相同的索引。例如：

        BacksGas_Flow_sccm ContextID  StepID  Time_Elapsed         Time_ms
104082            1.757812   7325335       3       153.238 08:49:06.900000
104082            1.757812   7325335      27       153.238 08:49:06.900000
205388            1.757812   7324656       2         145.9 07:16:31.660000
205388            1.757812   7324656      26         145.9 07:16:31.660000
105119            1.953125   7290176       1       139.695 09:30:39.170000
105119            1.953125   7290176      25       139.695 09:30:39.170000

我现在想做的是检查哪些行具有相同的索引，然后仅将具有StepID的行保留为25、26、27，然后删除或删除其{{1} }是1、2、3，并且所有其他索引不是重复的行都必须保留。

因此，所需的输出将是：

StepID

和已删除或删除的行将是：

       BacksGas_Flow_sccm ContextID  StepID  Time_Elapsed         Time_ms
104082            1.757812   7325335      27       153.238 08:49:06.900000
205388            1.757812   7324656      26         145.9 07:16:31.660000
105119            1.953125   7290176      25       139.695 09:30:39.170000

Answer 1

我认为最简单的解决方案是从X中删除concat：

united_data = pd.concat([step25, step26, step27], sort=True)

我相信这里只能将Series.isin与Index.duplicated和boolean indexing一起使用：

df1 = df[df['StepID'].isin([25,26,27]) & united_data.index.duplicated(keep=False)]
print (df1)
        cksGas_Flow_sccm  ContextID  StepID  Time_Elapsed          Time_ms
104082          1.757812    7325335      27       153.238  08:49:06.900000
205388          1.757812    7324656      26       145.900  07:16:31.660000
105119          1.953125    7290176      25       139.695  09:30:39.170000

Answer 2

您似乎只在更改StepID列。在这种情况下，直接进行更改而不连接任何内容可能会更简单：

step25['StepID'] = 25
step26['StepID'] = 26
step27['StepID'] = 27
united_data = X.copy()     # unsure whether useful or not

for step in [step25, step26, step27]:
    united_data[step.index, 'StepID'] = step.StepID

如何比较具有相同索引的2行，并根据特定条件删除其中一行？

2 个答案: