Question

我有一个Spark DataFrame，我试图根据以前的列创建一个新列，但对我来说困难的部分是我已经计算了列的行值。例如：

col1 | col2 | col3

1 | 2 | 3

4 | 5 | 0

3 | 1 | 1

所以，我想要一个新列，其中包含表达式列的名称每行max（col1，col2，col3）。所以，期望的输出：

col1 | col2 | col3 | col4

1 | 2 | 3 | 'COL3'

4 | 5 | 0 | 'COL2'

3 | 1 | 1 | 'COL1'

无论如何可以在PySpark中做到吗？

Answer 1

这不是一个理想的答案，因为它迫使你回到RDD。如果我找到一个可以让你留在DataFrame宇宙中的更好的，我会更新我的答案。但是现在这应该有效。

a = sc.parallelize([[1,2,3],[4,5,0],[3,1,1]])
headers = ["col1", "col2", "col3"]

b = a.map(lambda x: (x[0], x[1], x[2], headers[x.index(max(x))]))

b.toDF(headers.append("max_col")).show()

这基本上允许您通过迭代遍历RDD来使用python中的max操作。然后它通过索引标题列表找到正确的列。

同样，我不确定这是最好的方式，我希望能找到更好的方法。

在PySpark中基于行的操作添加列

1 个答案: