如何在三个字段上使用partitionby将字段中的所有值配对

时间:2019-09-06 17:25:18

标签: python pyspark

我一直在思考如何转换此DataFrame

df1 = sqlContext.createDataFrame(
    [
        ('312','Pbge', '06/20/2011', '95951', 359.93),
        ('312','Pbge', '06/20/2011', '95957',60.10),
        ('591','iTW', '11/13/2011', '92341',75.87),
        ('591','iTW', '11/13/2011', 'v2020',23.77),                                     
        ('591','iTW', '11/13/2011', 'v2211',66.02),
        ('195','Y1b', '08/25/2011', '71020',9.03),
        ('195','Y1b', '08/25/2011', '94060',44.60),
        ('195','Y1b', '08/25/2011', '94640',15.53),
        ('195','Y1b', '08/25/2011', '99213',36.63)
    ],
    schema=('bene_id','rend', 'date', 'code','amt')
)

进入此

df2 = sqlContext.createDataFrame(
    [
        ('312','Pbge', '06/20/2011', '95951', 359.93,'95951', '95957'),
        ('312','Pbge', '06/20/2011', '95957',60.10, '95957','95951'),
        ('591','iTW', '11/13/2011', '92341',75.87,'92341','v2020'),
        ('591','iTW', '11/13/2011', '92341',75.87,'92341','v2211'),
        ('591','iTW', '11/13/2011', 'v2020',23.77,'v2020','92341'),
        ('591','iTW', '11/13/2011', 'v2020',23.77,'v2020','v2211'),
        ('591','iTW', '11/13/2011', 'v2211',66.027,'v2211','92341'),
        ('591','iTW', '11/13/2011', 'v2211',66.02,'v2211','v2020'),
        ('195','Y1b', '08/25/2011', '71020',9.03,'71020','94060'),
        ('195','Y1b', '08/25/2011', '71020',9.03,'71020','94640'),
        ('195','Y1b', '08/25/2011', '71020',9.03,'71020','99213'),
        ('195','Y1b', '08/25/2011', '94060',44.6,'94060','71020'),
        ('195','Y1b', '08/25/2011', '94060',44.6,'94060','94640'),
        ('195','Y1b', '08/25/2011', '94060',44.6,'94060','99213'),
        ('195','Y1b', '08/25/2011', '94640',15.53,'94640','71020'),
        ('195','Y1b', '08/25/2011', '94640',15.53,'94640','94060'),
        ('195','Y1b', '08/25/2011', '94640',15.53,'94640','99213'),
        ('195','Y1b', '08/25/2011', '99213',36.63,'99213','71020'),
        ('195','Y1b', '08/25/2011', '99213',36.63,'99213','94060'),
        ('195','Y1b', '08/25/2011', '99213',36.63,'99213','94640')
    ],
    schema=('bene_id','rend', 'date', 'code','amt','col1', 'col2')
)

bene_id, rend and data上进行分区,这是事实,我想将col1col2中代码中的所有项配对。 amt应该是col1,因为它出现在df1中。结果是df2。这将适用于非常大的数据。数据帧如下图所示 enter image description here enter image description here 我需要帮助。

1 个答案:

答案 0 :(得分:0)

IIUC,您可以使用itertools.product并返回带有函数的数据框,然后在公共列上合并:

f= lambda x: pd.DataFrame([(a,b) for a,b in itertools.product(x,x) if a!=b],
                    columns=['col1','col2'])
cols=['bene_id', 'rend', 'date']
m=df.groupby(cols,sort=False)['code'].apply(f).reset_index([0,1,2])
final=m.merge(df[['code','amt']],left_on='col1',right_on='code').reindex(
                               df.columns.union(['col1','col2'],sort=False),axis=1)

   bene_id  rend        date   code     amt   col1   col2
0      312  Pbge  06/20/2011  95951  359.93  95951  95957
1      312  Pbge  06/20/2011  95957   60.10  95957  95951
2      591   iTW  11/13/2011  92341   75.87  92341  v2020
3      591   iTW  11/13/2011  92341   75.87  92341  v2211
4      591   iTW  11/13/2011  v2020   23.77  v2020  92341
5      591   iTW  11/13/2011  v2020   23.77  v2020  v2211
6      591   iTW  11/13/2011  v2211   66.02  v2211  92341
7      591   iTW  11/13/2011  v2211   66.02  v2211  v2020
8      195   Y1b  08/25/2011  71020    9.03  71020  94060
9      195   Y1b  08/25/2011  71020    9.03  71020  94640
10     195   Y1b  08/25/2011  71020    9.03  71020  99213
11     195   Y1b  08/25/2011  94060   44.60  94060  71020
12     195   Y1b  08/25/2011  94060   44.60  94060  94640
13     195   Y1b  08/25/2011  94060   44.60  94060  99213
14     195   Y1b  08/25/2011  94640   15.53  94640  71020
15     195   Y1b  08/25/2011  94640   15.53  94640  94060
16     195   Y1b  08/25/2011  94640   15.53  94640  99213
17     195   Y1b  08/25/2011  99213   36.63  99213  71020
18     195   Y1b  08/25/2011  99213   36.63  99213  94060
19     195   Y1b  08/25/2011  99213   36.63  99213  94640