我有两个数据框。数据框A中的每一行都是产品的包装,数据框B中包含产品ID和其卖方ID。
数据框A:
package_name | product_1 | product_2 | product_3 | product_4
package a | 12 | 15 | NaN | NaN
package b | 17 | 16 | 14 | NaN
package c | 12 | 11 | 17 | 19
数据框B:
product_id | seller_id
12 | seller1
15 | seller1
12 | seller2
15 | seller2
17 | seller3
16 | seller3
14 | seller3
(每个产品可以有多个卖家,每个卖家都有多个产品。)
我想知道哪些卖家提供包装产品(基于数据框A)。这是预期的结果:
数据框C:
package_name | product_1 | product_2 | product_3 | product_4 | seller_id
package a | 12 | 15 | NaN | NaN | seller1
package a | 12 | 15 | NaN | NaN | seller2
package b | 17 | 16 | 14 | NaN | seller3
卖方1和卖方2都具有包装a的“所有”产品,卖方3都具有包装b的“所有”产品。
如何实现Dataframe C?
答案 0 :(得分:2)
想法是DataFrame.merge
与通过集合的匹配子集创建的帮助器DataFrame进行正确连接时使用:
print (B)
product_id seller_id
0 12 seller1
1 15 seller1
2 12 seller2
3 15 seller2
4 17 seller3
5 16 seller3
6 14 seller3
7 12 seller4
8 15 seller4
9 14 seller4
A = A.set_index('package_name')
f = lambda x: set([int(y) for y in x if y == y])
a = A.apply(f, axis=1).to_dict()
#print (a)
b = B.groupby('seller_id')['product_id'].apply(set).to_dict()
#print (b)
c = [(k, k1) for k, v in a.items() for k1,v1 in b.items() if v.issubset(v1)]
#print (c)
C1 = pd.DataFrame(c, columns=['package_name','seller_id'])
print (C1)
package_name seller_id
0 package a seller1
1 package a seller2
2 package a seller4
3 package b seller3
C = A.merge(C1, on='package_name', how='right')
print (C)
package_name product_1 product_2 product_3 product_4 seller_id
0 package a 12 15 NaN NaN seller1
1 package a 12 15 NaN NaN seller2
2 package a 12 15 NaN NaN seller4
3 package b 17 16 14.0 NaN seller3