Question

我想使用熊猫将单行中的某些列值与所有其他行进行比较。我创建了以下循环，但由于我的数据框包含约400,000行，因此它将永久执行有没有更聪明/更快的方法来执行此操作抱歉，我不是很会python流利的即时通讯，我更习惯于用.NET语言进行编码

我的数据框看起来像这样：

    NAME            PROFILE URL             Final Addres
0   ProfileA    appexample.co/userxyz       http://example.com
1   ProfileB    appexample.co/userxyz_1     http://example.com  
2   ProfileC    appexample.co/userabc       http://anotherexample.com
3   ProfileD    appexample.co/userabc_3     http://anotherexample.com
4   ProfileE    appexample.co/userjyl       http://example123.com

在这种情况下，我试图识别重复的（配置文件）（配置文件A和配置文件B）和（配置文件C和配置文件D）是重复的，因为： 1.具有相同的个人资料网址（例如，用户位于user和user_1中） 2.具有相同的最终地址

当前使用以下代码的代码：

possible_dup = []
    for row in test.iterrows():
    first = str(row[1]['PROFILE URL'])
    first_url = str(row[1]['Final Address'])
    for sec_row in test.iterrows():
        second = str(sec_row[1]['PROFILE URL'])
        second_url = str(sec_row[1]['Final Address'])
        if (row[1]['PROFILE URL'] == sec_row[1]['PROFILE URL']) :
            continue
        if ((first in second) and (first_url == second_url)):
            dup = '{} , {}'.format(first,second)
            possible_dup.append(dup)

运行时间超过24小时并且仍在运行，我正在使用jupyter笔记本

Answer 1

签出duplicated()方法。从文档中：

返回表示重复行的布尔系列。

对您特别有用的是可选参数，它仅选择列的子集。根据您的确切目标，可以使用duplicated()方法执行几项操作：

要确定要使用的重复行

duplicates = test.duplicated(subset = ['PROFILE URL', 'FINAL Addres'], keep = False)

要识别所有您要使用的重复用户

    duplicate_users = test[test.duplicated(subset = ['PROFILE URL', 'FINAL Addres'], keep = First)]

要返回没有重复的数据帧（每个以前的重复现在仅显示一次）：

duplicates = test.duplicated(subset = ['PROFILE URL', 'FINAL Addres'])
duplicate_free_df = test.loc[~duplicates]

Answer 2

在duplicated()参数中使用keep为False，这样我们就可以识别所有重复项。

df2 = df[df.duplicated(subset=['Final Addres'],keep=False)]

print(df2)


       NAME              PROFILE URL               Final Addres
0  ProfileA    appexample.co/userxyz         http://example.com
1  ProfileB  appexample.co/userxyz_1         http://example.com
2  ProfileC    appexample.co/userabc  http://anotherexample.com
3  ProfileD  appexample.co/userabc_3  http://anotherexample.com

比较熊猫中某一行和所有其他行的值

2 个答案: