Question

我有以下两个数据帧：

第一个df

#df1 -----

    location            Ethnic Origins       Percent(1)
0   Beaches-East York   English              18.9 
1   Davenport           Portuguese           22.7
2   Eglinton-Lawrence   Polish               12.0

第二个df

 #df2 -----

    location                                            lat        lng
0   Beaches—East York, Old Toronto, Toronto, Golde...   43.681470   -79.306021
1   Davenport, Old Toronto, Toronto, Golden Horses...   43.671561   -79.448293
2   Eglinton—Lawrence, North York, Toronto, Golden...   43.719265   -79.429765

预期输出：

我想使用#df1的location列，因为它更干净，并保留了所有其他列。我不需要位置列上的城市，国家/地区信息。

    location            Ethnic Origins   Percent(1)  lat       lng
0   Beaches-East York   English          18.9        43.681470  -79.306021
1   Davenport           Portuguese       22.7        43.671561  -79.448293
2   Eglinton-Lawrence   Polish           12.0        43.719265  -79.429765

我尝试了几种方法来合并它们，但无济于事。

这将返回所有经纬度的NaN

df3 = pd.merge(df1, df2, on="location", how="left")

这将返回所有种族和百分比行的NaN

df3 = pd.merge(df1, df2, on="location", how="right")

Answer 1

我们应该使用findall创建密钥

df2['location']=df2.location.str.findall('|'.join(df1.location)).str[0]
df3 = pd.merge(df1, df2, on="location", how="left")

Answer 2

我正在猜测您遇到的问题是您要合并的列不相同，即在df2.location中找不到要合并到{{1 }}。尝试先更改它们，它应该起作用：

df1

Answer 3

正如其他人所指出的那样，问题在于“位置”列不共享任何值。一种解决方案是使用正则表达式删除所有从第一个逗号开始并扩展到字符串结尾的内容：

df2.location = df2.location.replace(r',.*', '', regex=True)

使用您提供的确切数据仍然无法使用，因为两个数据框中的破折号不同。您可以通过类似的方式解决此问题（这次无需使用正则表达式）：

df2.location = df2.location.replace('—', '-')

然后按照您的建议进行合并

df3 = pd.merge(df1, df2, on="location", how="left")

合并两个数据框并保留唯一的列

3 个答案: