以下是我尝试使用spark或scala + spark实现的目标:
列_1中的每个“A”实例表示新组的开始,直到下一个“A”。试图填充“石斑鱼”专栏。我知道我可以使用嵌套列表或循环以非常复杂的方式完成此任务,但我认为它们必须使用spark更快。或者scala和spark的简单组合,我没有想到或者不知道。
下面的代码在MySql中:
之前:
column_1 grouper
A
B
C
A
B
C
D
A
B
SELECT @x:=1;
UPDATE table SET grouper=IF(column_1='A',@x:=@x+1,@x);
后:
column_1 grouper
A 2
B 2
C 2
A 3
B 3
C 3
D 3
A 4
B 4
我在Spark中尝试过与上述相似但没有成功的事情:
var group = 1
val mydf4 = mydf3.withColumn("grouper",
when(col("column_1").equalTo("A"),group=group+1).otherwise(group))
答案 0 :(得分:0)
我是如何在Scala中完成的
df <- structure(list(Value = c(100, 200, 350, 200, 150, 100, 120, 180,
100, 300), X1 = c(3L, 4L, 3L, 2L, NA, 2L, NA, 2L, 2L, 1L), X2 = c(2L,
2L, 2L, 1L, 1L, 2L, NA, 3L, NA, NA), X3 = c(1L, 3L, 2L, 2L, 3L,
1L, 4L, 2L, 2L, NA), X4 = c(NA, 3L, 4L, 2L, 4L, 2L, 3L, 4L, 3L,
2L), X5 = c(4L, 1L, 2L, 4L, 3L, 1L, 3L, 3L, 4L, 1L), X6 = c(NA,
NA, 1L, 2L, 1L, 2L, 4L, 2L, 3L, 1L), X7 = c(3L, 1L, NA, 1L, 4L,
1L, 3L, 2L, 1L, 2L), X8 = c(4L, NA, 3L, 1L, 3L, 1L, NA, 2L, 2L,
1L), X9 = c(NA, 1L, NA, 1L, 1L, 1L, 3L, 2L, 1L, 2L), X10 = c(4L,
1L, 3L, 2L, 2L, 1L, 2L, 1L, 1L, 3L)), .Names = c("Value", "X1",
"X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10"), row.names = c(NA,
-10L), class = "data.frame")