我目前有2个数据框,1个用于捐赠者,1个用于筹款。理想情况下,我想要找到的是,如果有任何筹款人也捐赠,如果是的话,将一些信息复制到我的募捐人数据集(捐赠者姓名,电子邮件和他们的第一次捐赠)。我的数据有问题 1)我需要通过姓名和电子邮件进行匹配,但用户可能会略有不同的名称(前Kat和Kathy)。 2)捐赠者和筹款人的名称重复。 2a)有了捐赠者,我可以得到唯一的姓名/电子邮件组合,因为我只关心第一个捐赠日期 2b)虽然我需要保留两行,而不是像日期一样丢失数据,但我还是要筹款。
我现在拥有的示例代码:
import pandas as pd
import datetime
from fuzzywuzzy import fuzz
import difflib
donors = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Tom Smith","Jane Doe","Jane Doe","Kat test"]), "Email": pd.Series(['a@a.ca','a@a.ca','b@b.ca','c@c.ca','something@a.ca','d@d.ca']),"Date": (["27/03/2013 10:00:00 AM","1/03/2013 10:39:00 AM","2/03/2013 10:39:00 AM","3/03/2013 10:39:00 AM","4/03/2013 10:39:00 AM","27/03/2013 10:39:00 AM"])})
fundraisers = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Kathy test","Tes Ester", "Jane Doe"]),"Email": pd.Series(['a@a.ca','a@a.ca','d@d.ca','asdf@asdf.ca','something@a.ca']),"Date": pd.Series(["2/03/2013 10:39:00 AM","27/03/2013 11:39:00 AM","3/03/2013 10:39:00 AM","4/03/2013 10:40:00 AM","27/03/2013 10:39:00 AM"])})
donors["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
fundraisers["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
donors["code"] = donors.apply(lambda row: str(row['name'])+' '+str(row['Email']), axis=1)
idx = donors.groupby('code')["Date"].transform(min) == donors['Date']
donors = donors[idx].reset_index().drop('index',1)
因此,这给了我每个捐赠者的第一次捐赠(假设任何具有完全相同名称和电子邮件的人都是同一个人)。
理想情况下,我希望我的筹款人数据集看起来像:
Date Email name Donor Name Donor Email Donor Date
2013-03-27 10:00:00 a@a.ca John Doe John Doe a@a.ca 2013-03-27 10:00:00
2013-01-03 10:39:00 a@a.ca John Doe John Doe a@a.ca 2013-03-27 10:00:00
2013-02-03 10:39:00 d@d.ca Kathy test Kat test d@d.ca 2013-03-27 10:39:00
2013-03-03 10:39:00 asdf@asdf.ca Tes Ester
2013-04-03 10:39:00 something@a.ca Jane Doe Jane Doe something@a.ca 2013-04-03 10:39:00
我尝试了这个帖子:is it possible to do fuzzy match merge with python pandas?但是继续让索引超出范围错误(猜测它不喜欢筹款活动中的重复名称):(那么任何想法如何匹配/合并这些数据集?< / p>
使用for循环(它工作但速度超慢,我觉得必须有更好的方法)
fundraisers["donor name"] = ""
fundraisers["donor email"] = ""
fundraisers["donor date"] = ""
for donindex in range(len(donors.index)):
max = 75
for funindex in range(len(fundraisers.index)):
aname = donors["name"][donindex]
comp = fundraisers["name"][funindex]
ratio = fuzz.ratio(aname, comp)
if ratio > max:
if (donors["Email"][donindex] == fundraisers["Email"][funindex]):
ratio *= 2
max = ratio
fundraisers["donor name"][funindex] = aname
fundraisers["donor email"][funindex] = donors["Email"][donindex]
fundraisers["donor date"][funindex] = donors["Date"][donindex]
答案 0 :(得分:4)
这里有更多pythonic(在我看来),工作(在你的例子中)代码,没有显式循环:
def get_donors(row):
d = donors.apply(lambda x: fuzz.ratio(x['name'], row['name']) * 2 if row['Email'] == x['Email'] else 1, axis=1)
d = d[d >= 75]
if len(d) == 0:
v = ['']*3
else:
v = donors.ix[d.idxmax(), ['name','Email','Date']].values
return pd.Series(v, index=['donor name', 'donor email', 'donor date'])
pd.concat((fundraisers, fundraisers.apply(get_donors, axis=1)), axis=1)
输出:
Date Email name donor name donor email donor date
0 2013-03-27 10:00:00 a@a.ca John Doe John Doe a@a.ca 2013-03-01 10:39:00
1 2013-03-01 10:39:00 a@a.ca John Doe John Doe a@a.ca 2013-03-01 10:39:00
2 2013-03-02 10:39:00 d@d.ca Kathy test Kat test d@d.ca 2013-03-27 10:39:00
3 2013-03-03 10:39:00 asdf@asdf.ca Tes Ester
4 2013-03-04 10:39:00 something@a.ca Jane Doe Jane Doe something@a.ca 2013-03-04 10:39:00
答案 1 :(得分:1)
我会使用Jaro-Winkler,因为它是当前可用的最高效且最准确的近似字符串匹配算法[Cohen, et al.],[Winkler]。
这就是我使用jellyfish包中的Jaro-Winkler所做的事情:
def get_closest_match(x, list_strings):
best_match = None
highest_jw = 0
for current_string in list_strings:
current_score = jellyfish.jaro_winkler(x, current_string)
if(current_score > highest_jw):
highest_jw = current_score
best_match = current_string
return best_match
df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))
df1.join(df2)
输出:
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
更新:使用 Levenshtein 模块中的jaro_winkler来提高性能。
from jellyfish import jaro_winkler as jf_jw
from Levenshtein import jaro_winkler as lv_jw
%timeit jf_jw("appel", "apple")
>> 339 ns ± 1.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit lv_jw("appel", "apple")
>> 193 ns ± 0.675 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
答案 2 :(得分:1)
如何使用熊猫识别DataFrame中的模糊重复
def get_ratio(row):
name = row['Name_1']
return fuzz.token_sort_ratio(name,"Ceylon Hotels Corporation")
df[df.apply(get_ratio, axis=1) > 70]