我有2组数据框
IDs Keywords
0 1234 APPLE ABCD
1 1234 ORANGE
2 1234 LEMONS
3 5346 ORANGE
4 5346 STRAWBERRY
5 5346 BLUEBERRY
6 8793 TEA COFFEE
第二个数据框:
IDs Name
0 1234 APPLE ABCD ONE
1 5346 APPLE ABCD
2 1234 STRAWBERRY YES
3 8793 ORANGE AVAILABLE
4 8793 TEA AVAILABLE
5 8793 TEA COFFEE
我想根据ID级别搜索关键字, 将其用于第二个数据框并搜索列:名称 如果存在相同名称的名称中包含的关键字,则提供任何指示符True,否则为False。
例如: 对于ID 1234,APPLE ABCD,ORANGE,LEMONS是关键字。因此,在第二个数据帧中: 带有APPLE ABCD ONE的索引行0将为True,因为“ APPLE ABCD”是关键字的一部分
对于ID 5346,关键词是ORANGE,STRAWBERRY,BLUEBERRY。因此,在第二个数据帧中,带有APPLE ABCD的索引行1将为False。
IDs Name Indicator
0 1234 APPLE ABCD ONE True
1 5346 APPLE ABCD False
2 1234 STRAWBERRY YES False
3 8793 ORANGE AVAILABLE False
4 8793 TEA AVAILABLE False
5 8793 TEA COFFEE True
答案 0 :(得分:0)
您需要:
# create a list of tuples from 1st dataframe
kw = list(zip(df1.IDs, df1.Keywords))
def func(ids, name):
if (ids,name.split(" ")[0]) in kw:
return True
return False
df2['Indicator'] = df2.apply(lambda x: func(x['IDs'],x['Names']), axis=1)
修改
使用ID和关键字的组合创建元组列表
kw = list(zip(df1.IDs, df1.Keywords))
# [(1234, 'APPLE ABCD'), (1234, 'ORANGE'), (1234, 'LEMONS'), (5346, 'ORANGE'), (5346, 'STRAWBERRY'), (5346, 'BLUEBERRY'), (8793, 'TEA COFFEE')]
unique_kw = list(df1['Keywords'].unique())
# ['APPLE ABCD', 'ORANGE', 'LEMONS', 'STRAWBERRY', 'BLUEBERRY', 'TEA COFFEE']
def samp(x):
for u in unique_kw:
if u in x:
return u
return None
# This will fetch the keywords from column which will be used for compare
df2['indicator'] = df2['Names'].apply(lambda x: samp(x))
df2['indicator'] = df2.apply(lambda x: True if (x['IDs'], x['indicator']) in kw else False, axis=1)
输出:
IDs Names indicator
0 1234 APPLE ABCD ONE True
1 5346 APPLE ABCD False
2 1234 NO STRAWBERRY YES False
3 8793 ORANGE AVAILABLE False
4 8793 TEA AVAILABLE False
5 8793 TEA COFFEE True
答案 1 :(得分:0)
pandas
操作来执行此操作,这样也会更加高效。# Let there be two DataFrames: kw_df, name_df
# Group all keywords of each ID in a list, associate it with the names
kw_df = kw_df.groupby('IDs').aggregate({'Keywords': list})
merge_df = name_df.join(kw_df, on='IDs')
# Check if any keyword is in the name
def is_match(name, kws):
return any(kw in name for kw in kws)
merge_df['Indicator'] = merge_df.apply(lambda row: is_match(row['Name'], row['Keywords']), axis=1)
print(merge_df)
输出如下:
IDs Name Keywords Indicator
0 1234 APPLE ABCD ONE [APPLE ABCD, ORANGE, LEMONS] True
1 5346 APPLE ABCD [ORANGE, STRAWBERRY, BLUEBERRY] False
2 1234 STRAWBERRY YES [APPLE ABCD, ORANGE, LEMONS] False
3 8793 ORANGE AVAILABLE [TEA COFFEE] False
4 8793 TEA AVAILABLE [TEA COFFEE] False
5 8793 TEA COFFEE [TEA COFFEE] True
答案 2 :(得分:0)
您可以同时使用merge
和groupby
来使用lambda
,如下所示:
>>> df.merge(df2).groupby(['IDs','Name']).apply(lambda x: any(x['Name'].str.contains('|'.join(x['Keywords'])))).rename('Indicator').reset_index()
IDs Name Indicator
0 1234 APPLE ABCD True
1 1234 STRAWBERRY YES False
2 5346 APPLE ABCD False
3 8793 ORANGE AVAILABLE False
4 8793 TEA AVAILABLE True