如果特定列中有重复值,则删除整行

时间:2019-10-24 09:33:37

标签: python pandas dataframe

我已阅读 CSV文件(具有客户的名称和地址)并将数据分配到DataFrame表中。

csv文件(或DataFrame表)的说明

DataFrame包含多行和7列

数据库示例

Client_id Client_Name Address1        Address3       Post_Code   City_Name              Full_Address                            

 C0000001     A       10000009    37 RUE DE LA GARE    L-7535      MERSCH       37 RUE DE LA GARE,L-7535, MERSCH     
 C0000001     A       10000009    37 RUE DE LA GARE    L-7535      MERSCH       37 RUE DE LA GARE,L-7535, MERSCH     
 C0000001     A       10000009    37 RUE DE LA GARE    L-7535      MERSCH       37 RUE DE LA GARE,L-7535, MERSCH     
 C0000002     B       10001998  RUE EDWARD STEICHEN    L-1855  LUXEMBOURG  RUE EDWARD STEICHEN,L-1855,LUXEMBOURG     
 C0000002     B       10001998  RUE EDWARD STEICHEN    L-1855  LUXEMBOURG  RUE EDWARD STEICHEN,L-1855,LUXEMBOURG     
 C0000002     B       10001998  RUE EDWARD STEICHEN    L-1855  LUXEMBOURG  RUE EDWARD STEICHEN,L-1855,LUXEMBOURG     
 C0000003     C       11000051       9 RUE DU BRILL    L-3898       FOETZ           9 RUE DU BRILL,L-3898 ,FOETZ     
 C0000003     C       11000051       9 RUE DU BRILL    L-3898       FOETZ           9 RUE DU BRILL,L-3898 ,FOETZ     
 C0000003     C       11000051       9 RUE DU BRILL    L-3898       FOETZ           9 RUE DU BRILL,L-3898 ,FOETZ     
 C0000004     D       10000009    37 RUE DE LA GARE    L-7535      MERSCH       37 RUE DE LA GARE,L-7535, MERSCH     
 C0000005     E       10001998  RUE EDWARD STEICHEN    L-1855  LUXEMBOURG  RUE EDWARD STEICHEN,L-1855,LUXEMBOURG     

到目前为止,我已经编写了这段代码来生成上述表格:

代码为

import pandas as pd
import glob
Excel_file = 'Address.xlsx'
Address_Info = pd.read_excel(Excel_file)

# rename the columns name
Address_Info.columns = ['Client_ID', 'Client_Name','Address_ID','Street_Name','Post_Code','City_Name','Country'] 

# extract specfic columns into a new dataframe
Bin_Address= Address_Info[['Client_Name','Address_ID','Street_Name','Post_Code','City_Name','Country']].copy()


# Clean existing whitespace from the ends of the strings
Bin_Address= Bin_Address.apply(lambda x: x.str.strip(), axis=1)  # ← added

# Adding a new column called (Full_Address) that concatenate address columns into one 
# for example   Karlaplan 13,115 20,STOCKHOLM,Stockholms län, Sweden
Bin_Address['Full_Address'] = Bin_Address[Bin_Address.columns[1:]].apply(lambda x: ','.join(x.dropna().astype(str)), axis=1)

Bin_Address['Full_Address']=Bin_Address[['Full_Address']].copy()


Bin_Address['latitude'] = 'None'
Bin_Address['longitude'] = 'None'

# Remove repetitive addresses
#Temp = list( dict.fromkeys(Bin_Address.Full_Address) )

# Remove repetitive values ( I do beleive the modification should be here)
Temp = list( dict.fromkeys(Address_Info.Client_ID) )

如果 Client id,Client name和Full_Address 列中包含重复的值,我希望删除整行,到目前为止,代码未显示任何错误,但同时,我得到了预期的结果(我相信修改将在所附代码的最后一行)

预期输出为

Client_id Client_Name Address1        Address3       Post_Code   City_Name              Full_Address                            
 C0000001     A       10000009    37 RUE DE LA GARE    L-7535     MERSCH           37 RUE DE LA GARE,L-7535, MERSCH            
 C0000002     B       10001998    RUE EDWARD STEICHEN  L-1855     LUXEMBOURG       RUE EDWARD STEICHEN,L-1855,LUXEMBOURG         
 C0000003     C       11000051    9 RUE DU BRILL       L-3898     FOETZ            9 RUE DU BRILL,L-3898 ,FOETZ         
 C0000004     D       10000009    37 RUE DE LA GARE    L-7535     MERSCH           37 RUE DE LA GARE,L-7535, MERSCH     
 C0000005     E       10001998    RUE EDWARD STEICHEN  L-1855     LUXEMBOURG       RUE EDWARD STEICHEN,L-1855,LUXEMBOURG     

2 个答案:

答案 0 :(得分:0)

尝试:

df = df.drop_duplicates(['Client id', 'Client name', 'Full_Address'])

答案 1 :(得分:0)

您可以使用来自熊猫的名为dorp_duplicates()的内置方法。另外,您还可以使用很多开箱即用的选项。

<your_dataframe>.drop_duplicates(subset=["Client_id", "Client_name", "Full_Address"])

如果要保留第一个值或最后一个值,还可以选择何时重复。

  <your_dataframe>.drop_duplicates(subset=["Client_id", "Client_name", "Full_Address"], keep="first") # "first" or "last"

默认情况下,它将始终保留第一个值。