重复,drop_duplicates故障

时间:2019-10-16 07:38:56

标签: python pandas

这是问题的继续:How to compare two txt files and then apply changes in one of them

我选择了“生物熊猫”模块。但是,我认为该模块导入的数据帧存在某种类型的问题,具体来说就是重复/ drop_duplicates。我有很大的数据框:

df:

col1    col2    col3    col4    col5    col6    col7    col8    col9

0   ATOM    N   SER     15  17.203  0.286   72.985  4pxz
1   ATOM    CA  SER     15  16.713  1.342   73.869  4pxz
2   ATOM    C   SER     15  17.885  2.188   74.412  4pxz
3   ATOM    O   SER     15  18.028  3.351   74.013  4pxz
4   ATOM    CB  SER     15  15.889  0.750   75.014  4pxz
...     ...     ...     ...     ...     ...     ...     ...     ...
3   ATOM    CD  ARG     93  12.319  8.102   61.886  hatp
4   ATOM    NE  ARG     93  11.978  6.754   61.425  hatp
5   ATOM    CZ  ARG     93  11.731  5.714   62.217  hatp
6   ATOM    NH2     ARG     93  11.430  4.535   61.694  hatp
7   ATOM    NH1     ARG     93  11.793  5.843   63.538  hatp

3148 rows × 8 columns

我想使用以下方法检查重复项的范围:

df2 = df[df.duplicated(['col3','col4','col5'])] # show me duplicates containing identical type(col3), abbreviation(col4) and number(col5).

然后我得到了

col1    col2    col3    col4    col5    col6    col7    col8

2132    ATOM    CA      HIS     1063    38.442  -16.479     -5.209  4pxz
2136    ATOM    CB      HIS     1063    37.502  -15.555     -6.008  4pxz
2138    ATOM    CG      HIS     1063    38.007  -15.211     -7.378  4pxz
2140    ATOM    ND1     HIS     1063    38.342  -16.194     -8.293  4pxz
2142    ATOM    CD2     HIS     1063    38.213  -14.000     -7.943  4pxz
2144    ATOM    CE1     HIS     1063    38.749  -15.553     -9.375  4pxz
2146    ATOM    NE2     HIS     1063    38.688  -14.231     -9.213  4pxz
0       ATOM    CA      ARG     93  11.357  9.429   58.493  hatp
1       ATOM    CB      ARG     93  12.236  9.564   59.757  hatp
2       ATOM    CG      ARG     93  11.569  9.166   61.087  hatp
3       ATOM    CD      ARG     93  12.319  8.102   61.886  hatp
4       ATOM    NE      ARG     93  11.978  6.754   61.425  hatp
5       ATOM    CZ      ARG     93  11.731  5.714   62.217  hatp
6       ATOM    NH2     ARG     93  11.430  4.535   61.694  hatp
7       ATOM    NH1     ARG     93  11.793  5.843   63.538  hatp

预期输出:

col1    col2    col3    col4    col5    col6    col7    col8    col9

606     ATOM    CA  ARG     93  11.357  9.429   58.493  4pxz
609     ATOM    CB  ARG     93  12.236  9.564   59.757  4pxz
610     ATOM    CG  ARG     93  13.088  8.333   60.120  4pxz
611     ATOM    CD  ARG     93  13.985  7.822   58.995  4pxz
612     ATOM    NE  ARG     93  14.503  6.485   59.295  4pxz
613     ATOM    CZ  ARG     93  15.012  5.642   58.400  4pxz
614     ATOM    NH1 ARG     93  15.074  5.979   57.116  4pxz
615     ATOM    NH2 ARG     93  15.455  4.453   58.780  4pxz
0   ATOM    CA      ARG     93  11.357  9.429   58.493  hatp
1   ATOM    CB      ARG     93  12.236  9.564   59.757  hatp
2   ATOM    CG      ARG     93  11.569  9.166   61.087  hatp
3   ATOM    CD      ARG     93  12.319  8.102   61.886  hatp
4   ATOM    NE      ARG     93  11.978  6.754   61.425  hatp
5   ATOM    CZ      ARG     93  11.731  5.714   62.217  hatp
6   ATOM    NH2     ARG     93  11.430  4.535   61.694  hatp
7   ATOM    NH1     ARG     93  11.793  5.843   63.538  hatp

如您所见,它没有遵循plicated()方法中的指示(drop_duplicates的工作原理完全相同)。我需要使用:

df2=df[df['col5'] == 93 ]

怎么了?

2 个答案:

答案 0 :(得分:0)

答案 1 :(得分:0)

正确答案:

df2 = df[df.duplicated(subset = ['col3','col4','col5'], keep = False)]

非常感谢你们!