我有两个数据框
data = {'account_name':['prepaid', 'postpaid', 'books', 'stationary','software','printer', 'mouse'], 'category':['admin','admin','admin','admin','it','it','it']}
df1 = pd.DataFrame(data)
df1>
account_name category
0 prepaid admin
1 postpaid admin
2 books admin
3 stationary admin
4 software it
5 printer it
6 mouse it
data2 = {'account_name':['stationary costs', 'prepaid expenses', 'postpaid expenses', 'mouse', 'software expenses']}
df2 =pd.DataFrame(data2)
df2>
account_name
0 stationary costs
1 prepaid expenses
2 postpaid expenses
3 mouse
4 software expenses
我想要做的是部分匹配 df2['account_name']
列中的 df1['account_name']
列,如果该列中的任何单词有任何匹配,则在 {{1} 中返回其对应的类别}}。可以部分匹配,也可以完全匹配
df2
知道怎么做吗?
答案 0 :(得分:1)
如果需要将多个单词映射到一起,可以使用 Series.str.extract
by dict 的键并通过 Series.map
匹配:
d = df1.set_index('account_name')['category']
pat = '|'.join(r"\b{}\b".format(x) for x in d.keys())
df2['keyword'] = df2['account_name'].str.extract('('+pat+')', expand=False).map(d)
如果可能,按空格分割值并分别映射每个单词,请使用:
d = df1.set_index('account_name')['category']
f = lambda x: next(iter(d[y] for y in x.split() if y in d))
df2['category'] = df2['account_name'].apply(f)
print (df2)
account_name category
0 stationary costs admin
1 prepaid expenses admin
2 postpaid expenses admin
3 mouse it
4 software expenses it
测试是否为不匹配的值返回 NaN
:
data = {'account_name':[ 'postpaid', 'books', 'stationary','software','printer', 'mouse'],
'category':['admin','admin','admin','it','it','it']}
df1 = pd.DataFrame(data)
data2 = {'account_name':['stationary costs', 'prepaid expenses',
'postpaid expenses', 'mouse', 'software expenses']}
df2 =pd.DataFrame(data2)
d = df1.set_index('account_name')['category']
pat = '|'.join(r"\b{}\b".format(x) for x in d.keys())
df2['keyword'] = df2['account_name'].str.extract('('+pat+')', expand=False).map(d)
print (df2)
account_name keyword
0 stationary costs admin
1 prepaid expenses NaN
2 postpaid expenses admin
3 mouse it
4 software expenses it