我有一个示例pandas数据框如下:
df = pd.DataFrame({
'notes': pd.Series(['meth cook makes meth with purity of over 96%', 'meth cook is also called Heisenberg', 'meth cook has cancer', 'he is known as the best meth cook', 'Meth Dealer added chili powder to his batch', 'Meth Dealer learned to make the best meth', 'everyone goes to this Meth Dealer for best shot', 'girlfriend of the meth dealer died', 'this lawyer is a people pleasing person', 'cinnabon has now hired the lawyer as a baker', 'lawyer had to take off in the end', 'lawyer has a lot of connections who knows other guy']),
'name': pd.Series([np.nan, 'Walter White', np.nan, np.nan, np.nan, np.nan, 'Jessie Pinkman', np.nan, 'Saul Goodman', np.nan, np.nan, np.nan]),
'occupation': pd.Series(['meth cook', np.nan, np.nan, np.nan, np.nan, np.nan, 'meth dealer', np.nan, np.nan, 'lawyer', np.nan, np.nan])
})
name notes occupation
NaN meth cook makes meth with purity of over 96% meth cook
Walter White meth cook is also called Heisenberg NaN
NaN meth cook has cancer NaN
NaN he is known as the best meth cook NaN
NaN Meth Dealer added chili powder to his batch NaN
NaN Meth Dealer learned to make the best meth NaN
Jessie Pinkman everyone goes to this Meth Dealer for best shot meth dealer
NaN girlfriend of the meth dealer died NaN
Saul Goodman this lawyer is a people pleasing person NaN
NaN cinnabon has now hired the lawyer as a baker lawyer
NaN lawyer had to take off in the end NaN
NaN lawyer has a lot of connections who knows other guy NaN
所以,我们总共有三个职业:
pd.unique(df.occupation)
array(['meth cook', 'meth dealer', 'lawyer'], dtype=object)
我想在'notes'列中查找'占用'值,如果占用中已存在值,则用匹配的占用替换该行的任何缺失值。 例如:在第二行,缺少职业。但是,如果我们查找('meth cook','meth dealer','lawyer')的'notes'列,我们会看到第二行的'notes'栏中存在'meth cook'。所以,缺少的职业应该填写'meth cook'
我试过了:
df.occupation[df.occupation.notnull()].apply(lambda x: df.occupation.str.extract('('+x+')'))
然而,它并没有给我我想要的结果。我希望看到如下结果:
name notes occupation
NaN meth cook makes meth with purity of over 96% meth cook
Walter White meth cook is also called Heisenberg meth cook
NaN meth cook has cancer meth cook
NaN he is known as the best meth cook meth cook
NaN Meth Dealer added chili powder to his batch meth dealer
NaN Meth Dealer learned to make the best meth meth dealer
Jessie Pinkman everyone goes to this Meth Dealer for best shot meth dealer
NaN girlfriend of the meth dealer died meth dealer
Saul Goodman this lawyer is a people pleasing person lawyer
NaN cinnabon has now hired the lawyer as a baker lawyer
NaN lawyer had to take off in the end lawyer
NaN lawyer has a lot of connections who knows other guy lawyer
有人可以提供任何意见吗?
答案 0 :(得分:1)
您可以使用occupation
填充notes
中使用str.contains
的{{1}}中的缺失值来对for循环执行此操作:
occ = pd.unique(df.occupation[df.occupation.notnull()])
for pa in occ:
subset = df.notes.str.contains(pa, case=False)
df.occupation[subset] = df.occupation[subset].fillna(pa)
In [40]: df
Out[40]:
name notes occupation
0 NaN meth cook makes meth with purity of over 96% meth cook
1 Walter White meth cook is also called Heisenberg meth cook
2 NaN meth cook has cancer meth cook
3 NaN he is known as the best meth cook meth cook
4 NaN Meth Dealer added chili powder to his batch meth dealer
5 NaN Meth Dealer learned to make the best meth meth dealer
6 Jessie Pinkman everyone goes to this Meth Dealer for best shot meth dealer
7 NaN girlfriend of the meth dealer died meth dealer
8 Saul Goodman this lawyer is a people pleasing person lawyer
9 NaN cinnabon has now hired the lawyer as a baker lawyer
10 NaN lawyer had to take off in the end lawyer
11 NaN lawyer has a lot of connections who knows othe... lawyer