根据Scala / Apache Spark中的特定列值更新分组器列

时间:2018-06-17 03:16:28

标签: scala apache-spark

以下是我尝试使用spark或scala + spark实现的目标:

列_1中的每个“A”实例表示新组的开始,直到下一个“A”。试图填充“石斑鱼”专栏。我知道我可以使用嵌套列表或循环以非常复杂的方式完成此任务,但我认为它们必须使用spark更快。或者scala和spark的简单组合,我没有想到或者不知道。

下面的代码在MySql中:

之前:

column_1   grouper
A 
B
C
A
B
C
D
A
B

SELECT @x:=1;
UPDATE table SET grouper=IF(column_1='A',@x:=@x+1,@x);

后:

column_1   grouper
A          2 
B          2
C          2
A          3
B          3
C          3
D          3
A          4
B          4

我在Spark中尝试过与上述相似但没有成功的事情:

var group = 1

     val mydf4 = mydf3.withColumn("grouper", 
when(col("column_1").equalTo("A"),group=group+1).otherwise(group))

1 个答案:

答案 0 :(得分:0)

我是如何在Scala中完成的

df <- structure(list(Value = c(100, 200, 350, 200, 150, 100, 120, 180, 
100, 300), X1 = c(3L, 4L, 3L, 2L, NA, 2L, NA, 2L, 2L, 1L), X2 = c(2L, 
2L, 2L, 1L, 1L, 2L, NA, 3L, NA, NA), X3 = c(1L, 3L, 2L, 2L, 3L, 
1L, 4L, 2L, 2L, NA), X4 = c(NA, 3L, 4L, 2L, 4L, 2L, 3L, 4L, 3L, 
2L), X5 = c(4L, 1L, 2L, 4L, 3L, 1L, 3L, 3L, 4L, 1L), X6 = c(NA, 
NA, 1L, 2L, 1L, 2L, 4L, 2L, 3L, 1L), X7 = c(3L, 1L, NA, 1L, 4L, 
1L, 3L, 2L, 1L, 2L), X8 = c(4L, NA, 3L, 1L, 3L, 1L, NA, 2L, 2L, 
1L), X9 = c(NA, 1L, NA, 1L, 1L, 1L, 3L, 2L, 1L, 2L), X10 = c(4L, 
1L, 3L, 2L, 2L, 1L, 2L, 1L, 1L, 3L)), .Names = c("Value", "X1", 
"X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10"), row.names = c(NA, 
-10L), class = "data.frame")