Question

我有一个火花数据框：

+-----------------+------------+--------------------+------------------+------------------+
|opp_id__reference|oplin_status|               stage|        std_amount|   std_line_amount|
+-----------------+------------+--------------------+------------------+------------------+
|OP-180618-7456377|     Pending|7 - Deliver & Val...|31395.462999391966|13072.069816517043|
|OP-180618-7456377|     Pending|7 - Deliver & Val...|31395.462999391966| 13.85958009943131|
+-----------------+------------+--------------------+------------------+------------------+

我想将 GREAT 分配给std_line_amount> = 30％std_amount的oppt_line。

预期输出：

542 OP-180112-6925769   Pending 7 - Deliver & Validate  363802.836296   31261.159197    False
543 OP-180112-6925769   Pending 7 - Deliver & Validate  363802.836296   46832.656747    False
544 OP-180112-6925769   Pending 7 - Deliver & Validate  363802.836296   118542.329840   False
359 OP-180222-7065558   Pending 7 - Deliver & Validate  2.434888e+05    670.785793  False
389 OP-160712-5051474   Pending 7 - Deliver & Validate  1.288711e+05    1288.780000 False
770 OP-180720-7563258   Pending 7 - Deliver & Validate  1.366182e+05    13.859580   False

为此，我在pandas dataframe中进行了此操作：

DF_BR6['greater']=DF_BR6.std_line_amount.gt(DF_BR6.groupby('opp_id__reference').std_amount.transform('sum')*0.3)

您能帮我在spark数据框中实现它吗？

谢谢

最佳

在Spark数据框中分配值

0 个答案: