我试图弄清楚如何基于正则表达式匹配进行滚动分配。我有一个键的数据框(keys_df)和新数据的数据框(new_df)。
对于new_df中的每个名称,如果名称包含keys_df.contains
列中的任何子字符串,则将parent_id
和parent_name
分配给该新记录。如果没有匹配项,则保留null。
从两个数据帧中:
import pandas as pd
keys_df = pd.DataFrame([ ["steve" , "2266", "Steve, Inc"],
["edward" , "3377", "Ed, Inc"],
["Juan" , "4488", "Juan, Inc"],
["Pedro" , "5599", "Pedro, Inc"]],
columns=["contains", "parent_id", "parent_name"])
new_df = pd.DataFrame([ [ "9845" , "steve (bikes) qc", None,None],
[ "9846" , "mark inc",None,None],
[ "9847" , "young steve",None,None],
[ "9845" , "Juan 22",None,None],
[ "9845" , "Zak",None,None]],
columns=["id", "name", "parent_name", "parent_id"])
我希望输出看起来像:
id name parent_id parent_name
"9845" "steve (bikes) qc" "2266" "Steve, Inc"
"9846" "mark inc" None None
"9847" "young steve" "2266" "Steve, Inc"
"9845" "Juan 22" "4488" "Juan, Inc"
"9845" "Zak" None None
这里还有一个效率问题。输出数据帧将被附加到SQLite表上。因此,如果有一种方法可以在SQLite中通过熊猫来做到这一点,那么将不胜感激。
感谢您的帮助。
答案 0 :(得分:2)
将pandas
str.extract
与merge
结合使用:
pat = '('+'|'.join(keys_df.contains)+')'
new_df['contains'] = new_df.name.str.extract(pat)
df = new_df.loc[:,['id','name','contains']].merge(keys_df,on='contains',how='left')
df.drop('contains',axis=1,inplace=True)
print(df)
id name parent_id parent_name
0 9845 steve (bikes) qc 2266 Steve, Inc
1 9846 mark inc NaN NaN
2 9847 young steve 2266 Steve, Inc
3 9845 Juan 22 4488 Juan, Inc
4 9845 Zak NaN NaN
说明:
print(new_df.name.str.extract(pat))
0
0 steve
1 NaN
2 steve
3 Juan
4 NaN