匹配 pandas 数据框列中的单词并返回其值

时间:2021-07-14 10:19:20

标签: python pandas dataframe

我有两个数据框

        data = {'account_name':['prepaid', 'postpaid', 'books', 'stationary','software','printer', 'mouse'], 'category':['admin','admin','admin','admin','it','it','it']}
    
    df1 = pd.DataFrame(data)
    
        df1>        
    account_name          category
0   prepaid               admin
1   postpaid              admin
2   books                 admin
3   stationary            admin
4   software              it
5   printer               it
6   mouse                 it
    
    data2 = {'account_name':['stationary costs', 'prepaid expenses', 'postpaid expenses', 'mouse', 'software expenses']}
    
    df2 =pd.DataFrame(data2)
    
        df2>   
                account_name       
            0   stationary costs    
            1   prepaid expenses    
            2   postpaid expenses   
            3   mouse   
            4   software expenses

我想要做的是部分匹配 df2['account_name'] 列中的 df1['account_name'] 列,如果该列中的任何单词有任何匹配,则在 {{1} 中返回其对应的类别}}。可以部分匹配,也可以完全匹配

df2

知道怎么做吗?

1 个答案:

答案 0 :(得分:1)

如果需要将多个单词映射到一起,可以使用 Series.str.extract by dict 的键并通过 Series.map 匹配:

d = df1.set_index('account_name')['category']

pat = '|'.join(r"\b{}\b".format(x) for x in d.keys())
df2['keyword'] = df2['account_name'].str.extract('('+pat+')', expand=False).map(d)

如果可能,按空格分割值并分别映射每个单词,请使用:

d = df1.set_index('account_name')['category']

f = lambda x: next(iter(d[y] for y in x.split() if y in d))
df2['category'] = df2['account_name'].apply(f)
print (df2)
        account_name category
0   stationary costs    admin
1   prepaid expenses    admin
2  postpaid expenses    admin
3              mouse       it
4  software expenses       it

测试是否为不匹配的值返回 NaN

data = {'account_name':[ 'postpaid', 'books', 'stationary','software','printer', 'mouse'],
              'category':['admin','admin','admin','it','it','it']}
    
df1 = pd.DataFrame(data)


data2 = {'account_name':['stationary costs', 'prepaid expenses',
                         'postpaid expenses', 'mouse', 'software expenses']}
    
df2 =pd.DataFrame(data2)


d = df1.set_index('account_name')['category']

pat = '|'.join(r"\b{}\b".format(x) for x in d.keys())
df2['keyword'] = df2['account_name'].str.extract('('+pat+')', expand=False).map(d)
print (df2)
        account_name keyword
0   stationary costs   admin
1   prepaid expenses     NaN
2  postpaid expenses   admin
3              mouse      it
4  software expenses      it