从一个数据框中搜索关键词,然后将两者合并。熊猫蟒

时间:2020-10-19 06:21:27

标签: python pandas dataframe

我有一个类似以下的数据框

dfx=pd.DataFrame({'Title':
                  ['These cats are very cute',
                   'dogs and horse is a loyal animal',
                   'chicken layes eggs full of proteins',
                   'lion is the king of jungle']})

和另一个像这样的关键字的数据框

kwx=pd.DataFrame({'Tag':['cats','cute','dogs','horse','chicken', 'proteins', 'lion','jungle','eggs'],
                 'Area':['animal',np.NaN,'animal','animal','bird','food','animal','place','food']
                })

要做的是从Tag的{​​{1}}中的kwx中搜索Title。并且,如果存在标签,则将该标签的kwx行与标题合并。

这就是我所做的。 拆分标题并搜索标题中的每个标签,然后返回前两个匹配结果。

dfx

输出

dfx['splittitle'] = dfx['Title'].str.lower().str.split()#strop title
dfx['matchedName'] = dfx['splittitle'].apply(lambda x: [item for item in x if item in kwx['Tag'].tolist()])
dfx[['term1','term2']] = dfx.matchedName.apply(pd.Series).iloc[:,0:2]#return only two matches
dfx.drop('splittitle',axis=1,inplace=True)

我执行的下一步是将Title matchedName term1 term2 These cats are very cute ['cats', 'cute'] cats cute dogs and horse is a loyal animal ['dogs', 'horse'] dogs horse chicken layes eggs full of proteins ['chicken', 'eggs', 'proteins'] chicken eggs lion is the king of jungle ['lion', 'jungle'] lion jungle term1列与term2数据帧合并

kwx

输出

merged_dfx = dfx.merge(kwx,  how='inner',left_on=['term1'],right_on='Tag',suffixes=('_1','_2'))
merged_dfx = merged_dfx.merge(kwx,  how='inner',left_on=['term2'],right_on='Tag',suffixes=('_1','_2'))
merged_dfx.drop(['Tag_1','Tag_2'],axis=1,inplace=True)

我想要的输出。而不是只限于前两个匹配,我想要所有结果并希望数据框具有以下形状
输出

Title                               matchedName                     term1   term2   Area_1  Area_2
These cats are very cute            ['cats', 'cute']                cats    cute    animal  
dogs and horse is a loyal animal    ['dogs', 'horse']               dogs    horse   animal  animal
chicken layes eggs full of proteins ['chicken', 'eggs', 'proteins'] chicken eggs    bird    food
lion is the king of jungle          ['lion', 'jungle']              lion    jungle  animal  place

PS:由于这里的空间限制,为了使代码漂亮,我放弃了matchedName列

1 个答案:

答案 0 :(得分:1)

t=kwx.Tag.tolist()#puts all strings in Tag into a list
dfx['term']=dfx.Title.str.split(' ')# Puts Title values into a list in a new colum term
dfx['term']=dfx['term'].map(lambda x: [*{*x} & {*t}])#Leverages sets to find strings both in t and term
dfx=dfx.assign(Tag=dfx.term)#creates a column called Tag
newdf=pd.merge(dfx.explode('Tag'),kwx).drop(columns=['Tag'])#Expands dfx to allow merging to kwx
newdf.groupby(['Title',newdf['term'].str.join(',')])['Area'].agg(list)#Groupby Title and term and add area to list
  


Title                                term                 
These cats are very cute             cute,cats                     [nan, animal]
chicken layes eggs full of proteins  proteins,chicken,eggs    [food, bird, food]
dogs and horse is a loyal animal     horse,dogs                 [animal, animal]
lion is the king of jungle           jungle,lion                 [place, animal]