我有一个熊猫数据框,其中的一列是一个有时包含一个国家的字符串。
我有一个包含所有可能的国家名称的数组。
如果要包含在第一列中,我想在数据框中返回一个带有国家名称的新列,否则返回空值。
我期望数据框:
country = ['Angola', 'Belgium']
df = pd.DataFrame(np.array([['A product for Angola', 'Angola'], ['A product for Belgium', 'Belgium']]), columns=['Product', 'Country'])
答案 0 :(得分:2)
将Series.str.extract
与正则表达式结合使用-用|
结合正则表达式OR
的所有值:
country = ['Angola', "Korea (Democratic People's Republic of)"]
df = pd.DataFrame(np.array([['A product for Angola', 'Angola'],
["A product for Korea (Democratic People's Republic of)",
"Korea (Democratic People's Republic of)"],
['A product for new', None]]), columns=['Product', 'Country'])
import re
pat = '|'.join(re.escape(x) for x in country)
df['newCountry'] = df['Product'].str.extract('('+ pat + ')', expand=False)
print (df)
Product \
0 A product for Angola
1 A product for Korea (Democratic People's Repub...
2 A product for new
Country \
0 Angola
1 Korea (Democratic People's Republic of)
2 None
newCountry
0 Angola
1 Korea (Democratic People's Republic of)
2 NaN