我想用df1
数据框“类别”列中的正确值填充df2
数据框的“类别”列。
import pandas as pd
df1 = pd.DataFrame({"Receiver": ["Insurance company", "Shop", "Pizza place", "Library", "Gas station 24/7", "Something else", "Whatever receiver"], "Category": ["","","","","","",""]})
df2 = pd.DataFrame({"Category": ["Insurances", "Groceries", "Groceries", "Fastfood", "Fastfood", "Car"], "Searchterm": ["Insurance", "Shop", "Market", "Pizza", "Burger", "Gas"]})
输出:
df1
Receiver Category
0 Insurance company
1 Shop
2 Pizza place
3 Library
4 Gas station 24/7
5 Something else
6 Whatever receiver
df2
Category Searchterm
0 Insurances Insur
1 Groceries Shop
2 Groceries Market
3 Fastfood Pizza
4 Fastfood Burger
5 Car Gas
我想逐行比较df1["Receiver"]
和df2["Searchterm"]
,然后在后者甚至部分匹配前者的情况下,将该行的df2["Category"]
分配给{{ 1}}。
例如,df1["Category"]
中的“披萨”与df2["Searchterm"]
中的“披萨店”部分匹配,因此我想将“快餐”(在df1["Receiver"]
中为披萨的类别)分配给df2["Category"]
中“披萨店”的类别。
所需的输出为:
df1["Category"]
那么如何用正确的类别填充df1
Receiver Category
0 Insurance company Insurances
1 Shop Groceries
2 Pizza place Fastfood
3 Library
4 Gas station 24/7 Car
5 Something else
6 Whatever receiver
?谢谢。
答案 0 :(得分:5)
在假设类别数相对于接收者数较小的情况下,一种策略是迭代类别。使用此解决方案时,请注意,只有 last 匹配项才会停留在找到多个类别的位置。
for tup in df2.itertuples(index=False):
mask = df1['Receiver'].str.contains(tup.Searchterm, regex=False)
df1.loc[mask, 'Category'] = tup.Category
print(df1)
# Category Receiver
# 0 Insurances Insurance company
# 1 Groceries Shop
# 2 Fastfood Pizza place
# 3 Library
# 4 Car Gas station 24/7
# 5 Something else
# 6 Whatever receiver
如前所述,该解决方案在df1
中的行上进行缩放比在df2
中的类别上进行缩放。为了说明这一点,请在下面考虑不同大小的输入数据帧的性能。
def jpp(df1, df2):
for tup in df2.itertuples(index=False):
df1.loc[df1['Receiver'].str.contains(tup.Searchterm, regex=False), 'Category'] = tup.Category
return df1
def user347(df1, df2):
df1['Category'] = df1['Receiver'].replace((df2['Searchterm'] + r'.*').values,
df2['Category'].values,
regex=True)
df1.loc[df1['Receiver'].isin(df1['Category']), 'Category'] = ''
return df1
df1 = pd.concat([df1]*10**4, ignore_index=True)
df2 = pd.concat([df2], ignore_index=True)
%timeit jpp(df1, df2) # 145 ms per loop
%timeit user347(df1, df2) # 364 ms per loop
df1 = pd.concat([df1], ignore_index=True)
df2 = pd.concat([df2]*100, ignore_index=True)
%timeit jpp(df1, df2) # 666 ms per loop
%timeit user347(df1, df2) # 88 ms per loop
答案 1 :(得分:3)
使用str.extract的另一种解决方案
pat = '('+'|'.join(df2['Searchterm'])+')'
df1["Category"] = df1['Receiver'].str.extract(pat)[0].map(df2.set_index('Searchterm')['Category'].to_dict()).fillna('')
Receiver Category
0 Insurance company Insurances
1 Shop Groceries
2 Pizza place Fastfood
3 Library
4 Gas station 24/7 Car
5 Something else
6 Whatever receiver
def jpp(df1, df2):
for tup in df2.itertuples(index=False):
df1.loc[df1['Receiver'].str.contains(tup.Searchterm, regex=False), 'Category'] = tup.Category
return df1
def user347(df1, df2):
df1['Category'] = df1['Receiver'].replace((df2['Searchterm'] + r'.*').values,
df2['Category'].values,
regex=True)
df1.loc[df1['Receiver'].isin(df1['Category']), 'Category'] = ''
return df1
def vai(df1, df2):
pat = '('+'|'.join(df2['Searchterm'])+')'
df1["Category"] = df1['Receiver'].str.extract(pat)[0].map(df2.set_index('Searchterm')['Category'].to_dict()).fillna('')
df1 = pd.concat([df1]*10**4, ignore_index=True)
df2 = pd.concat([df2], ignore_index=True)
%timeit jpp(df1, df2)
%timeit user347(df1, df2)
%timeit vai(df1, df2)
120 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
221 ms ± 4.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
78.2 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df1 = pd.concat([df1], ignore_index=True)
df2 = pd.concat([df2]*100, ignore_index=True)
%timeit jpp(df1, df2)
%timeit user347(df1, df2)
%timeit vai(df1, df2)
11.4 s ± 276 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
20.4 s ± 296 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
98.3 ms ± 408 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
答案 2 :(得分:3)
您可以将Series.replace
与regex
结合使用,以实现矢量化方法:
df1['Category'] = df1['Receiver'].replace(
(df2['Searchterm'] + r'.*').values,
df2['Category'].values,
regex=True
)
df1.loc[df1['Receiver'].isin(df1['Category']), 'Category'] = ''
print(df1)
Category Receiver
0 Insurances Insurance company
1 Groceries Shop
2 Fastfood Pizza place
3 Library
4 Car Gas station 24/7
5 Something else
6 Whatever receiver
请注意,这假设每个Searchterm
字符串都将出现在每个Receiver
字符串的开头。如果不正确,请相应地调整正则表达式。