我有一个火花数据框:
+-----------------+------------+--------------------+------------------+------------------+
|opp_id__reference|oplin_status| stage| std_amount| std_line_amount|
+-----------------+------------+--------------------+------------------+------------------+
|OP-180618-7456377| Pending|7 - Deliver & Val...|31395.462999391966|13072.069816517043|
|OP-180618-7456377| Pending|7 - Deliver & Val...|31395.462999391966| 13.85958009943131|
+-----------------+------------+--------------------+------------------+------------------+
我想将 GREAT 分配给std_line_amount> = 30%std_amount的oppt_line。
预期输出:
542 OP-180112-6925769 Pending 7 - Deliver & Validate 363802.836296 31261.159197 False
543 OP-180112-6925769 Pending 7 - Deliver & Validate 363802.836296 46832.656747 False
544 OP-180112-6925769 Pending 7 - Deliver & Validate 363802.836296 118542.329840 False
359 OP-180222-7065558 Pending 7 - Deliver & Validate 2.434888e+05 670.785793 False
389 OP-160712-5051474 Pending 7 - Deliver & Validate 1.288711e+05 1288.780000 False
770 OP-180720-7563258 Pending 7 - Deliver & Validate 1.366182e+05 13.859580 False
为此,我在pandas dataframe中进行了此操作:
DF_BR6['greater']=DF_BR6.std_line_amount.gt(DF_BR6.groupby('opp_id__reference').std_amount.transform('sum')*0.3)
您能帮我在spark数据框中实现它吗?
谢谢
最佳