这是问题的继续:How to compare two txt files and then apply changes in one of them
我选择了“生物熊猫”模块。但是,我认为该模块导入的数据帧存在某种类型的问题,具体来说就是重复/ drop_duplicates。我有很大的数据框:
df:
col1 col2 col3 col4 col5 col6 col7 col8 col9
0 ATOM N SER 15 17.203 0.286 72.985 4pxz
1 ATOM CA SER 15 16.713 1.342 73.869 4pxz
2 ATOM C SER 15 17.885 2.188 74.412 4pxz
3 ATOM O SER 15 18.028 3.351 74.013 4pxz
4 ATOM CB SER 15 15.889 0.750 75.014 4pxz
... ... ... ... ... ... ... ... ...
3 ATOM CD ARG 93 12.319 8.102 61.886 hatp
4 ATOM NE ARG 93 11.978 6.754 61.425 hatp
5 ATOM CZ ARG 93 11.731 5.714 62.217 hatp
6 ATOM NH2 ARG 93 11.430 4.535 61.694 hatp
7 ATOM NH1 ARG 93 11.793 5.843 63.538 hatp
3148 rows × 8 columns
我想使用以下方法检查重复项的范围:
df2 = df[df.duplicated(['col3','col4','col5'])] # show me duplicates containing identical type(col3), abbreviation(col4) and number(col5).
然后我得到了
col1 col2 col3 col4 col5 col6 col7 col8
2132 ATOM CA HIS 1063 38.442 -16.479 -5.209 4pxz
2136 ATOM CB HIS 1063 37.502 -15.555 -6.008 4pxz
2138 ATOM CG HIS 1063 38.007 -15.211 -7.378 4pxz
2140 ATOM ND1 HIS 1063 38.342 -16.194 -8.293 4pxz
2142 ATOM CD2 HIS 1063 38.213 -14.000 -7.943 4pxz
2144 ATOM CE1 HIS 1063 38.749 -15.553 -9.375 4pxz
2146 ATOM NE2 HIS 1063 38.688 -14.231 -9.213 4pxz
0 ATOM CA ARG 93 11.357 9.429 58.493 hatp
1 ATOM CB ARG 93 12.236 9.564 59.757 hatp
2 ATOM CG ARG 93 11.569 9.166 61.087 hatp
3 ATOM CD ARG 93 12.319 8.102 61.886 hatp
4 ATOM NE ARG 93 11.978 6.754 61.425 hatp
5 ATOM CZ ARG 93 11.731 5.714 62.217 hatp
6 ATOM NH2 ARG 93 11.430 4.535 61.694 hatp
7 ATOM NH1 ARG 93 11.793 5.843 63.538 hatp
预期输出:
col1 col2 col3 col4 col5 col6 col7 col8 col9
606 ATOM CA ARG 93 11.357 9.429 58.493 4pxz
609 ATOM CB ARG 93 12.236 9.564 59.757 4pxz
610 ATOM CG ARG 93 13.088 8.333 60.120 4pxz
611 ATOM CD ARG 93 13.985 7.822 58.995 4pxz
612 ATOM NE ARG 93 14.503 6.485 59.295 4pxz
613 ATOM CZ ARG 93 15.012 5.642 58.400 4pxz
614 ATOM NH1 ARG 93 15.074 5.979 57.116 4pxz
615 ATOM NH2 ARG 93 15.455 4.453 58.780 4pxz
0 ATOM CA ARG 93 11.357 9.429 58.493 hatp
1 ATOM CB ARG 93 12.236 9.564 59.757 hatp
2 ATOM CG ARG 93 11.569 9.166 61.087 hatp
3 ATOM CD ARG 93 12.319 8.102 61.886 hatp
4 ATOM NE ARG 93 11.978 6.754 61.425 hatp
5 ATOM CZ ARG 93 11.731 5.714 62.217 hatp
6 ATOM NH2 ARG 93 11.430 4.535 61.694 hatp
7 ATOM NH1 ARG 93 11.793 5.843 63.538 hatp
如您所见,它没有遵循plicated()方法中的指示(drop_duplicates的工作原理完全相同)。我需要使用:
df2=df[df['col5'] == 93 ]
怎么了?
答案 0 :(得分:0)
命令df.duplicated吗? 还要确保通过keep = False https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html
答案 1 :(得分:0)
正确答案:
df2 = df[df.duplicated(subset = ['col3','col4','col5'], keep = False)]
非常感谢你们!