Question

我有一个大型数据框（超过100列，数十万行），其中包含多个包含重复数据的行。我试图删除重复的行，保持在不同列中具有最大值的行。

基本上，我根据时间段将数据分类到单个分箱中，因此在不同时期，人们会发现大量重复，因为大多数实体都存在于所有时间段。但是，允许的是同一实体在给定时间段内出现不止一次。

我在python pandas: Remove duplicates by columns A, keeping the row with the highest value in column B中尝试了一种数据子集的方法，并计划与原始数据帧重新组合，df。

示例数据子集：

              unique_id   period_id   liq
index                                   
19            CAN00CE0     199001  0.017610
1903          **USA07WG0** 199001  1.726374
12404         **USA07WG0** 199001  0.090525
13330         USA08DE0     199001  1.397143
14090         USA04U80     199001  2.000716
12404         USA07WG0     199002  0.090525
13330         USA08DE0     199002  1.397143
14090         USA04U80     199002  2.000716

在上面的示例中，我想保留第一个实例（因为liq高于1.72）并丢弃第二个实例（liq较低，为0.09）。请注意，给定的period_id中可以有两个以上的重复项。

我试过这个但是非常对我来说很慢（我在超过5分钟后停止了它）：

def h(x):
    x = x.dropna() #idmax fails on nas, and happy to throw out where liq is na.
    return x.ix[x.liq.idmax()]

df.groupby([‘holt_unique_id’, ‘period_id’], group_keys = False).apply(lambda x: h(x))

我最终做了下面的事情，这更加冗长和丑陋，只是丢掉了一个副本，但这也很慢！考虑到类似复杂性的其他操作的速度，我想我会在这里要求更好的解决方案。

所以我的请求是真的要修复上面的代码，以便快速，下面给出了作为指导，如果在下面的静脉中，也许我也可以放弃基于索引的重复项，而不是我使用的reset_index / set_index方法：

def do_remove_duplicates(df):
    sub_df = df[['period_id', 'unique_id']] 
    grp = sub_df.groupby(['period_id', 'unique_id'], as_index = False)
    cln = grp.apply(lambda x: x.drop_duplicates(cols = 'unique_id'))   #apply drop_duplicates.  This line is the slow bit!
    cln = cln.reset_index()   #remove the index stuff that has been added
    del(cln['level_0'])   #remove the index stuff that has been added
    cln.set_index('level_1', inplace = True)   #set the index back to the original (same as df).
    df_cln = cln.join(df, how = 'left', rsuffix = '_right')   # join the cleaned dataframe with the original, discarding the duplicate rows using a left join.
    return df_cln

Answer 1

如此：

使用最大数据更新所有列。
选择一行（比如说第一行）。

这应该快得多，因为它的矢量化。

In [11]: g = df.groupby(["unique_id", "period_id"], as_index=False)

In [12]: g.transform("max")
Out[12]:
            liq
index
19     0.017610
1903   1.726374
12404  1.726374
13330  1.397143
14090  2.000716
12404  0.090525
13330  1.397143
14090  2.000716

In [13]: df.update(g.transform("max"))

In [14]: g.nth(0)
Out[14]:
          unique_id  period_id       liq
index
19         CAN00CE0     199001  0.017610
1903   **USA07WG0**     199001  1.726374
13330      USA08DE0     199001  1.397143
14090      USA04U80     199001  2.000716
12404      USA07WG0     199002  0.090525
13330      USA08DE0     199002  1.397143
14090      USA04U80     199002  2.000716

注意：我想首先使用groupby或者在这里使用groupby，但我认为有一个错误，他们扔掉了你的旧索引，我认为他们不应该......然而，nth是作品。

另一种方法是首先切出不等于liq max的那些：

(df[df["liq"] == g["liq"].transform("max")]  #  keep only max liq rows
 .groupby(["unique_id", "period_id"])
 .nth(0)

Pandas - 删除除了另一列

1 个答案: