我有一个类似以下的数据框
dfx=pd.DataFrame({'Title':
['These cats are very cute',
'dogs and horse is a loyal animal',
'chicken layes eggs full of proteins',
'lion is the king of jungle']})
和另一个像这样的关键字的数据框
kwx=pd.DataFrame({'Tag':['cats','cute','dogs','horse','chicken', 'proteins', 'lion','jungle','eggs'],
'Area':['animal',np.NaN,'animal','animal','bird','food','animal','place','food']
})
要做的是从Tag
的{{1}}中的kwx
中搜索Title
。并且,如果存在标签,则将该标签的kwx行与标题合并。
这就是我所做的。 拆分标题并搜索标题中的每个标签,然后返回前两个匹配结果。
dfx
输出
dfx['splittitle'] = dfx['Title'].str.lower().str.split()#strop title
dfx['matchedName'] = dfx['splittitle'].apply(lambda x: [item for item in x if item in kwx['Tag'].tolist()])
dfx[['term1','term2']] = dfx.matchedName.apply(pd.Series).iloc[:,0:2]#return only two matches
dfx.drop('splittitle',axis=1,inplace=True)
我执行的下一步是将Title matchedName term1 term2
These cats are very cute ['cats', 'cute'] cats cute
dogs and horse is a loyal animal ['dogs', 'horse'] dogs horse
chicken layes eggs full of proteins ['chicken', 'eggs', 'proteins'] chicken eggs
lion is the king of jungle ['lion', 'jungle'] lion jungle
和term1
列与term2
数据帧合并
kwx
输出
merged_dfx = dfx.merge(kwx, how='inner',left_on=['term1'],right_on='Tag',suffixes=('_1','_2'))
merged_dfx = merged_dfx.merge(kwx, how='inner',left_on=['term2'],right_on='Tag',suffixes=('_1','_2'))
merged_dfx.drop(['Tag_1','Tag_2'],axis=1,inplace=True)
我想要的输出。而不是只限于前两个匹配,我想要所有结果并希望数据框具有以下形状
输出
Title matchedName term1 term2 Area_1 Area_2
These cats are very cute ['cats', 'cute'] cats cute animal
dogs and horse is a loyal animal ['dogs', 'horse'] dogs horse animal animal
chicken layes eggs full of proteins ['chicken', 'eggs', 'proteins'] chicken eggs bird food
lion is the king of jungle ['lion', 'jungle'] lion jungle animal place
PS:由于这里的空间限制,为了使代码漂亮,我放弃了matchedName列
答案 0 :(得分:1)
t=kwx.Tag.tolist()#puts all strings in Tag into a list
dfx['term']=dfx.Title.str.split(' ')# Puts Title values into a list in a new colum term
dfx['term']=dfx['term'].map(lambda x: [*{*x} & {*t}])#Leverages sets to find strings both in t and term
dfx=dfx.assign(Tag=dfx.term)#creates a column called Tag
newdf=pd.merge(dfx.explode('Tag'),kwx).drop(columns=['Tag'])#Expands dfx to allow merging to kwx
newdf.groupby(['Title',newdf['term'].str.join(',')])['Area'].agg(list)#Groupby Title and term and add area to list
Title term
These cats are very cute cute,cats [nan, animal]
chicken layes eggs full of proteins proteins,chicken,eggs [food, bird, food]
dogs and horse is a loyal animal horse,dogs [animal, animal]
lion is the king of jungle jungle,lion [place, animal]