根据字符串出现率,相似度连接两个数据框

时间:2019-09-08 16:53:07

标签: python pandas dataframe

我有2个数据帧(可以说df1df2

df1具有(名称,姓氏,部门)

df2具有(id,filename)

我想要的是->合并它们(说df3

df3->(ID,文件名,名称,姓氏,部门)

共同点是文件名以工作者的名字结尾。

示例:

Filename : /company/workers/john 

Name : john ( No duplicate name vals on df1,df2)

通常在合并中,我们使用公共列,但现在没有公共列,因此,如何使用这种匹配/相似性来组合这两个数据帧? 如果我不能使用这种相似性,该如何合并它们?

3 个答案:

答案 0 :(得分:1)

您只需用/分割文件名列ID df2,然后获取最后一个组件

df2['name'] = df2['filename'].str.split('//').str[-1]

然后将df2中的名称列用作合并的键:)

答案 1 :(得分:1)

尝试这个:

pd.merge(df1, df2.apply(lambda x: pd.Series({"name": x.filename.split("/")[-1], "file_id": x.id, "filename": x.filename}), axis=1), on="name", how="left")

答案 2 :(得分:0)

Use str.rsplit(r"/",n=1,expand=True)[1].str.title(), where
rsplit: right split
n=1: max split
r"/": raw string, no escape seq.interpreted
expand: create new columns
title: steven --> Steven
Then merge them on "name".


In [25]: df1=pd.DataFrame( {"name":["John","Steven"], "surname":["Smith","Lee"], "departmen":["dep1","dep2"]})                

In [26]: df2=pd.DataFrame({"id":[240,250], "filename":["/company/workers/steven", "/company/workers/john"]})                  

In [27]: df1                                                                                                                  
Out[27]: 
     name surname departmen
0    John   Smith      dep1
1  Steven     Lee      dep2

In [28]: df2                                                                                                                  
Out[28]: 
    id                 filename
0  240  /company/workers/steven
1  250    /company/workers/john

In [29]: df2["name"]= df2.filename.str.rsplit(r"/",n=1,expand=True)[1].str.title()                                            

In [30]: df2                                                                                                                  
Out[30]: 
    id                 filename    name
0  240  /company/workers/steven  Steven
1  250    /company/workers/john    John

In [31]: pd.merge(df2,df1, on="name")                                                                                         
Out[31]: 
    id                 filename    name surname departmen
0  240  /company/workers/steven  Steven     Lee      dep2
1  250    /company/workers/john    John   Smith      dep1