如果列中包含单词,则使用找到的值生成一个新列

时间:2019-06-06 10:36:10

标签: python regex pandas

我正在尝试通过创建一个新的df ['Country Clean']来清理df ['Country']变量,如果该变量在df ['Country']列中找到,则会采用country变量的值。

我发现,如果我重复执行此命令,我还将删除以前的发现,并且我只会得到一个报告“俄罗斯”发现的变量

有没有办法做到这一点?

data = {'Number':['1', '2', '1', '2', '1', '2'], 'Country':['Italy 1', 'Italie', 'Ecco', 'Russia is in Euroasia' ,  'Yugoslavia', 'Russia']}
df = pd.DataFrame(data) 
df['Country Clean'] = df['Country'].str.replace(r'(^.*Italy.*$)', 'Italy')
df['Country Clean']  = df['Country'].str.replace(r'(^.*Russia.*$)', 'Russia')

预期产量

data2 = {'Number':['1', '2', '1', '2', '1', '2'], 'Country':['Italy', 'Italy', NaN, 'Russia' , NaN , 'Russia']}
exp = pd.DataFrame(data2) 
exp

2 个答案:

答案 0 :(得分:1)

使用:

In [15]: countries = ["italy", "russia", "yugoslavia", "italie"]

In [16]: for i in countries:df.loc[lambda x:x.Country.str.lower().str.contains(i), 'Country Clean'] = i.capitalize()

In [17]: df
Out[17]:
  Number                Country Country Clean
0      1                Italy 1         Italy
1      2                 Italie        Italie
2      1                   Ecco           NaN
3      2  Russia is in Euroasia        Russia
4      1             Yugoslavia    Yugoslavia
5      2                 Russia        Russia

答案 1 :(得分:1)

我建议先规范化国家/地区名称,然后根据允许的国家/地区列表更改“国家/地区清洁”列的值:

normalize_countries={"Italie": "Italy", "Rusia": "Russia"}    # Spelling corrections
pattern = r"\b(?:{})\b".format("|".join(normalize_countries)) # Regex to find misspellings

countries = ["Italy", "Russia"]                               # Country list
df['Country Clean'] = df['Country'].str.replace(pattern, lambda x: normalize_countries[x.group()])
def applyFunc(s):  
    for e in countries:
        if e in s:
            return e
    return 'NaN'

df['Country Clean'] = df['Country Clean'].apply(applyFunc)

输出:

>>> df
  Number                Country Country Clean
0      1                Italy 1         Italy
1      2                 Italie         Italy
2      1                   Ecco           NaN
3      2  Russia is in Euroasia        Russia
4      1             Yugoslavia           NaN
5      2                 Russia        Russia

df['Country'].str.replace(pattern, lambda x: normalize_countries[x.group()])行在Country列中搜索所有拼写错误的国家名称作为整个单词,并将其替换为正确的拼写变体。

如果您在countries列表中使用正则表达式,然后在re.search中使用if e in countries而不是applyFunc,则在搜索国家/地区时也可以添加整个单词检查。 / p>