我有两个数据框,均包含“电子邮件”列。理想情况下,电子邮件地址应一一匹配,但是由于输入错误或其他原因,许多电子邮件地址无法在另一个数据框中找到匹配项。 如何忽略两列中的大小写,删除特殊符号,然后合并电子邮件地址?
我的数据框如下:
df1 = pd.DataFrame({'URL': ['/','/','/instr-analytics'],
'Email': ['apple@gmail.com','bananA@gmail.com','peaR@gmail.com']})
df2 = pd.DataFrame({'URL': ['/s','/d','/qinstr-analytics'],
'Email': ['Apple@gmail.com','banana@gmail.com','peaR@gmail.com']})
在这种情况下如何匹配电子邮件地址?
答案 0 :(得分:0)
如果只是邮件地址,则可以尝试在邮件列中使用lower()
,然后在pd.merge
中使用
答案 1 :(得分:0)
我的解决方案取决于两个DataFrame的大小
代码:
import pandas as pd
import re
# email validation pattern
pattern = '^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$'
def distance(a, b):
"Calculates the Levenshtein distance between a and b."
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n, m)) space
a, b = b, a
n, m = m, n
current_row = range(n + 1) # Keep current and previous row, not entire matrix
for i in range(1, m + 1):
previous_row, current_row = current_row, [i] + [0] * n
for j in range(1, n + 1):
add, delete, change = previous_row[j] + 1, current_row[j - 1] + 1, previous_row[j - 1]
if a[j - 1] != b[i - 1]:
change += 1
current_row[j] = min(add, delete, change)
return current_row[n]
def prepare_df(df):
df["Email_lower"] = df["Email"].apply(lambda x: x.lower())
df["is_valid"] = df["Email_lower"].apply(lambda x: 0 if re.match(pattern, x) is None else 1)
# drop all invalid emails
df = df[df["is_valid"] == 1]
df["key"] = 0
return df
df1 = pd.DataFrame({'URL': ['/','/','/instr-analytics'],
'Email': ['apple@gmail.com','bananA@gmail.com','peaR@gmail.com']})
df2 = pd.DataFrame({'URL': ['/s','/d','/qinstr-analytics'],
'Email': ['Apple@gmail.com','banana@gmail.com','peaR@gmail.com']})
prepared_df1 = prepare_df(df1)
prepared_df2 = prepare_df(df2)
cross_merge = prepared_df1.merge(prepared_df2, on="key", how="outer")
cross_merge["dist"] = cross_merge.apply(lambda row: distance(row["Email_lower_x"], row["Email_lower_y"]), axis=1)
cross_merge[cross_merge["dist"] < 1]
这不适用于大型DataFrame,但是您可以优化解决方案
答案 2 :(得分:0)
类似这样的东西:
df1["Email"]=df1["Email"].apply(lambda x: x.lower())
df2["Email"]=df2["Email"].apply(lambda x: x.lower())
df1.merge(df2, on="Email",)