我需要创建一个函数/表达式来比较多个列('Cust ID Count'
,'Revenue'
,可能还有'Family Name'
以进行记录匹配,然后仅基于升序保留第一条记录。此外,此功能还将查看存在多个相似记录的2种不同情况:
'street'
(记录0 & 1
)外,所有列/系列中的多个记录都将匹配'street'
和'Family Name'
(记录3 & 4
)以外的所有列/系列中,多条记录将匹配我意识到看来我们只能使用Cust ID
计数和Revenue
作为匹配参数,但我也想尽可能使用'family name'
作为选择。
数据集:
idx Cust ID Count Family Name street Revenue
0 10 Smith spring 50 #match
1 10 Smith wilbur 50 #match
2 45 Jerry jane 35 #not a match
3 25 Cole mary 20 #match
4 25 Stein mary sue 20 #match
输出:
idx Cust ID Count Family Name street Revenue
0 10 Smith spring 50 #spring is kept due to alphabetical order
1 45 Jerry jane 35 #not a match
2 25 Cole mary 20 #mary is kept due to alphabetical order
答案 0 :(得分:0)
尝试一下:
(df.sort_values('Family Name')
.drop_duplicates(['Cust ID Count', 'Revenue'], keep='first')
.sort_index()
.reset_index(drop=True))
答案 1 :(得分:0)
或者比Chris A的优雅程度低:
df1 = df[["Cust ID Count", "Revenue"]]
df1.sort_values(by=["Cust ID Count", "Revenue"])
diff1 = df1["Cust ID Count"].values[1:] - df1["Cust ID Count"].values[:-1] == 0
diff2 = df1.Revenue.values[1:] - df1.Revenue.values[:-1] == 0
eq = (diff1 == 0 & diff2)
eq = np.insert(eq, 0, True)
df[eq]