我有两个数据框,一个名为“ foo”,一个名为“ bar”。我的数据框“ foo”具有一些唯一的列,而我的数据框“ bar”也具有一些唯一的列。但是,它们都共享一个列,即“ google”列。我正在尝试查看是否有一种方法可以将所有列保留在数据框1“ foo”中,并添加一个附加列“ CLRS”,如果该列中“ google”列中的内容为1 “ foo”行出现在“ bar”栏中“ google”列中的某处。
更具体地说,我们假设我的数据帧的结构如下:'foo'包含列:'foo_1','foo_2',...,'google'和bar包含列:'bar_1','bar_2 , ..., '谷歌'。 我想以这样的方式加入/合并“ foo”和“ bar”,使得“ foo”具有附加列“ CLRS”,如果“ google”在该行的“内容”中包含,则“ CLRS”具有1 foo”出现在“ bar”的“ google”列中。我尝试了以下代码:
'''
# foo examples
foo['foo1'] = ['dijkstra','TSP',...]
foo['foo2'] = ['Oculus','VR', ...]
.
.
.
foo['google'] = ['search','ads', 'A/B Testing', 'UI' ...]
# bar examples
bar['bar1'] = ['dijkstra','TSP',...]
bar['bar2'] = ['search','ads', ...]
.
.
.
# 'A/B Testing' appears in the column somewhere but 'ads' does
# not
bar['google'] = ['search','google_search', 'TDD', 'UI',
...,'A/B Testing', ...]
# my code
foo_merged =
foo.join(bar, how = 'left')
# my result
foo_merged['foo1'] = ['dijkstra','TSP',...]
foo_merged['foo2'] = ['search','ads', ...]
.
.
.
foo_merged['google'] = ['search','ads', ...]
foo_merged['CLRS'] = ['search','google_search', 'TDD', 'UI',
...]
# What I want as an output for foo_merged is:
foo_merged['foo1'] = ['dijkstra','TSP',...]
foo_merged['foo2'] = ['search','ads', ...]
.
.
.
foo_merged['google'] = ['search','ads', 'A/B Testing', 'UI'
...]
foo_merged['CLRS'] = [1,0,1,1,...]
'''
不幸的是,在运行上一个联接代码后,foo_merged包含foo的所有列和一个附加列,该列始终包含来自'bar'的'google'列的内容。我想要的结果将是df,如果“ foo”行中“ google”的内容作为“ bar”列中“ google”列的内容出现,则附加列“ CLRS”包含1,否则为0
答案 0 :(得分:0)
我相信您正在使用 indicator = True 查找merge
。
指示器将标记两个数据帧中是否存在的每一行
df = pd.merge(foo, bar, how='left', on = 'google', indicator = True)
df['CLRS'] = (df['_merge'] == 'both').astype(int)
#or df['CLRS'] = np.where(df['_merge'] == 'both', 1, 0)