我想按照原始顺序克隆列的值n次。 例如,如果我想在列2下面复制:
+---+
| v |
+---+
| 1 |
| 2 |
| 3 |
+---+
我在寻找:
+---+
| v |
+---+
| 1 |
| 2 |
| 3 |
| 1 |
| 2 |
| 3 |
+---+
使用explode或flatMap我只能得到:
+---+
| v |
+---+
| 1 |
| 1 |
| 2 |
| 2 |
| 3 |
| 3 |
+---+
代码:
%spark
val ds = spark.range(1, 4)
val cloneCount = 2
val clonedDs = ds.flatMap(r => Seq.fill(cloneCount)(r))
clonedDs.show()
我可能会对数据集ds进行自我联合,但是如果cloneCount很大,例如。 cloneCount = 200000,它是多次在循环中并集的首选解决方案吗?
答案 0 :(得分:1)
你可以试试这个:
// If the column values are expected to be in an increasing/descresing sequence
// then we add that to the orderBy: clone_index and col_value
// to get the values in order as they were initially
val clonedDs = ds.flatMap(col_value => Range(0, cloneCount)
.map(clone_index=>(clone_index,col_value)) )
clonedDs.orderBy("_1", "_2").map(_._2).show()
// If the column values are not expected to follow a sequence
// then we add another rank column and use that in orderBy along with clone_index
// to get the col_values in desired order
val clonedDs = ds.withColumn("rank", monotonically_increasing_id())
.flatMap(row => Range(0, cloneCount).map(
clone_index=> (clone_index, row.getLong(1), row.getLong(0))
) )
clonedDs.orderBy("_1", "_2").map(_._3).show()