Question

输入DF

col1   col2 ..... coln
 1      1             
 1      2                    
 1 .    3                
 2 .    1             
 2 .    2

我正在尝试添加一个新列，该列应该是

1. "max" for all the rows of the combination (col1 , max(col2),...coln)
2. "not_max" otherwise

输出DF：

 col1 . col2 . new_col ..... coln
  1       1     not_max
  1 .     2 .   not_max
  1       3 .   max
  2 .     1 .   not_max
  2 .     2 .   max

我能够通过使用groupBy创建一个新的DF并使用这个新列并加入到原始DF来组合它。有关如何直接实现此建议的任何建议。感谢。

Answer 1

您可以使用max作为sql窗口函数一次性执行此操作，并将计算出的max与col2进行比较：

df.selectExpr("*", 
    "case when col2 = max(col2) over (partition by col1)" + 
    "then 'max' else 'not max' end as new_col"
).show
+----+----+----+-------+
|col1|col2|col3|new_col|
+----+----+----+-------+
|   1|   1|   1|not max|
|   1|   2|   2|not max|
|   1|   3|   1|    max|
|   2|   1|   1|not max|
|   2|   2|   3|    max|
+----+----+----+-------+

基于Spark中2列组合的新列

1 个答案: