Question

我想按照原始顺序克隆列的值n次。例如，如果我想在列2下面复制：

+---+
| v |
+---+
| 1 |
| 2 |
| 3 |
+---+

我在寻找：

+---+
| v |
+---+
| 1 |
| 2 |
| 3 |
| 1 |
| 2 |
| 3 |
+---+

使用explode或flatMap我只能得到：

+---+
| v |
+---+
| 1 |
| 1 |
| 2 |
| 2 |
| 3 |
| 3 |
+---+

代码：

%spark
val ds = spark.range(1, 4)
val cloneCount = 2

val clonedDs = ds.flatMap(r => Seq.fill(cloneCount)(r))
clonedDs.show()

我可能会对数据集ds进行自我联合，但是如果cloneCount很大，例如。 cloneCount = 200000，它是多次在循环中并集的首选解决方案吗？

Answer 1

你可以试试这个：

// If the column values are expected to be in an increasing/descresing sequence
// then we add that to the orderBy: clone_index and col_value
// to get the values in order as they were initially

val clonedDs = ds.flatMap(col_value => Range(0, cloneCount)
                   .map(clone_index=>(clone_index,col_value)) )
clonedDs.orderBy("_1", "_2").map(_._2).show()



// If the column values are not expected to follow a sequence
// then we add another rank column and use that in orderBy along with clone_index
// to get the col_values in desired order

val clonedDs = ds.withColumn("rank", monotonically_increasing_id())
    .flatMap(row => Range(0, cloneCount).map(
                clone_index=> (clone_index, row.getLong(1), row.getLong(0))
          ) )

clonedDs.orderBy("_1", "_2").map(_._3).show()

如何使用原始顺序克隆spark中的列值

1 个答案: