我一直在思考如何转换此DataFrame
df1 = sqlContext.createDataFrame(
[
('312','Pbge', '06/20/2011', '95951', 359.93),
('312','Pbge', '06/20/2011', '95957',60.10),
('591','iTW', '11/13/2011', '92341',75.87),
('591','iTW', '11/13/2011', 'v2020',23.77),
('591','iTW', '11/13/2011', 'v2211',66.02),
('195','Y1b', '08/25/2011', '71020',9.03),
('195','Y1b', '08/25/2011', '94060',44.60),
('195','Y1b', '08/25/2011', '94640',15.53),
('195','Y1b', '08/25/2011', '99213',36.63)
],
schema=('bene_id','rend', 'date', 'code','amt')
)
进入此
df2 = sqlContext.createDataFrame(
[
('312','Pbge', '06/20/2011', '95951', 359.93,'95951', '95957'),
('312','Pbge', '06/20/2011', '95957',60.10, '95957','95951'),
('591','iTW', '11/13/2011', '92341',75.87,'92341','v2020'),
('591','iTW', '11/13/2011', '92341',75.87,'92341','v2211'),
('591','iTW', '11/13/2011', 'v2020',23.77,'v2020','92341'),
('591','iTW', '11/13/2011', 'v2020',23.77,'v2020','v2211'),
('591','iTW', '11/13/2011', 'v2211',66.027,'v2211','92341'),
('591','iTW', '11/13/2011', 'v2211',66.02,'v2211','v2020'),
('195','Y1b', '08/25/2011', '71020',9.03,'71020','94060'),
('195','Y1b', '08/25/2011', '71020',9.03,'71020','94640'),
('195','Y1b', '08/25/2011', '71020',9.03,'71020','99213'),
('195','Y1b', '08/25/2011', '94060',44.6,'94060','71020'),
('195','Y1b', '08/25/2011', '94060',44.6,'94060','94640'),
('195','Y1b', '08/25/2011', '94060',44.6,'94060','99213'),
('195','Y1b', '08/25/2011', '94640',15.53,'94640','71020'),
('195','Y1b', '08/25/2011', '94640',15.53,'94640','94060'),
('195','Y1b', '08/25/2011', '94640',15.53,'94640','99213'),
('195','Y1b', '08/25/2011', '99213',36.63,'99213','71020'),
('195','Y1b', '08/25/2011', '99213',36.63,'99213','94060'),
('195','Y1b', '08/25/2011', '99213',36.63,'99213','94640')
],
schema=('bene_id','rend', 'date', 'code','amt','col1', 'col2')
)
在bene_id, rend and data
上进行分区,这是事实,我想将col1
和col2
中代码中的所有项配对。 amt
应该是col1
,因为它出现在df1中。结果是df2。这将适用于非常大的数据。数据帧如下图所示
我需要帮助。
答案 0 :(得分:0)
IIUC,您可以使用itertools.product
并返回带有函数的数据框,然后在公共列上合并:
f= lambda x: pd.DataFrame([(a,b) for a,b in itertools.product(x,x) if a!=b],
columns=['col1','col2'])
cols=['bene_id', 'rend', 'date']
m=df.groupby(cols,sort=False)['code'].apply(f).reset_index([0,1,2])
final=m.merge(df[['code','amt']],left_on='col1',right_on='code').reindex(
df.columns.union(['col1','col2'],sort=False),axis=1)
bene_id rend date code amt col1 col2
0 312 Pbge 06/20/2011 95951 359.93 95951 95957
1 312 Pbge 06/20/2011 95957 60.10 95957 95951
2 591 iTW 11/13/2011 92341 75.87 92341 v2020
3 591 iTW 11/13/2011 92341 75.87 92341 v2211
4 591 iTW 11/13/2011 v2020 23.77 v2020 92341
5 591 iTW 11/13/2011 v2020 23.77 v2020 v2211
6 591 iTW 11/13/2011 v2211 66.02 v2211 92341
7 591 iTW 11/13/2011 v2211 66.02 v2211 v2020
8 195 Y1b 08/25/2011 71020 9.03 71020 94060
9 195 Y1b 08/25/2011 71020 9.03 71020 94640
10 195 Y1b 08/25/2011 71020 9.03 71020 99213
11 195 Y1b 08/25/2011 94060 44.60 94060 71020
12 195 Y1b 08/25/2011 94060 44.60 94060 94640
13 195 Y1b 08/25/2011 94060 44.60 94060 99213
14 195 Y1b 08/25/2011 94640 15.53 94640 71020
15 195 Y1b 08/25/2011 94640 15.53 94640 94060
16 195 Y1b 08/25/2011 94640 15.53 94640 99213
17 195 Y1b 08/25/2011 99213 36.63 99213 71020
18 195 Y1b 08/25/2011 99213 36.63 99213 94060
19 195 Y1b 08/25/2011 99213 36.63 99213 94640