不区分大小写的pandas dataframe.merge

时间:2015-04-21 02:39:15

标签: python csv pandas

我正在努力用最简单的方法在pandas中进行不区分大小写的合并。有没有办法在合并时做到这一点?我是否需要使用(?i)或带有ignorecase的正则表达式?在我的下面的代码片段中,我加入了一些国家,其中一个文件中可能是“美国”而另一个文件中的“美国”,我只想将这个案例排除在外。谢谢!

import pandas as pd
import csv
import sys

env_path = sys.argv[1]
map_path = sys.argv[2]


df_address = pd.read_csv(env_path + "\\address.csv")
df_CountryMapping = pd.read_csv(map_path + "\CountryMapping.csv")

df_merged = df_address.merge(df_CountryMapping, left_on="Country", right_on="NAME", how="left")

....

5 个答案:

答案 0 :(得分:7)

将用于合并的两列中的值小写,然后在小写列上合并

df_address['country_lower'] = df_address['Country'].str.lower()
df_CountryMapping['name_lower'] = df_CountryMapping['NAME'].str.lower()
df_merged = df_address.merge(df_CountryMapping, left_on="country_lower", right_on="name_lower", how="left")

答案 1 :(得分:2)

df_merged = pd.merge(df_address, df_CountryMapping, left_on=df_address["Country"].str.lower(), right_on=df_CountryMapping["NAME"].str.lower(), how="left")

答案 2 :(得分:1)

我建议在阅读后删除列名

df_address.columns=[c.lower() for c in df_address.columns]
df_CountryMapping.columns=[c.lower() for c in df_CountryMapping.columns]

然后更新值

df_address['country']=df_address['country'].str.lower()
df_CountryMapping['name']=df_CountryMapping['name'].str.lower()

只有这样,才能进行合并

df_merged = df_address.merge(df_CountryMapping, left_on="country", right_on="name", how="left")

答案 3 :(得分:1)

一种解决方案是将两个数据帧的列名称全部转换为小写。所以像这样:

df_address = pd.read_csv(env_path + "\\address.csv")
df_CountryMapping = pd.read_csv(map_path + "\CountryMapping.csv")

df_address.rename(columns=lambda x: x.lower(), inplace=True)
df_CountryMapping.rename(columns=lambda x: x.lower(), inplace=True)

df_merged = df_address.merge(df_CountryMapping, left_on="Country", right_on="NAME", how="left")

答案 4 :(得分:0)

另一个选项与“ .str.casefold()”一起使用,可以更全面地合并ASCII和其他语言字符。如果您仅使用英语字母字符,则应与“ .str.lower()”相同

df_address['country_casefolded'] = df_address['Country'].str.casefold()
df_CountryMapping['name_casefolded'] = df_CountryMapping['NAME'].str.casefold()
df_merged = df_address.merge(df_CountryMapping, left_on="country_casefolded", right_on="name_casefolded", how="left")