如何使用原始顺序克隆spark中的列值

时间:2018-03-17 17:43:12

标签: apache-spark spark-dataframe

我想按照原始顺序克隆列的值n次。 例如,如果我想在列2下面复制:

+---+
| v |
+---+
| 1 |
| 2 |
| 3 |
+---+

我在寻找:

+---+
| v |
+---+
| 1 |
| 2 |
| 3 |
| 1 |
| 2 |
| 3 |
+---+

使用explode或flatMap我只能得到:

+---+
| v |
+---+
| 1 |
| 1 |
| 2 |
| 2 |
| 3 |
| 3 |
+---+

代码:

%spark
val ds = spark.range(1, 4)
val cloneCount = 2

val clonedDs = ds.flatMap(r => Seq.fill(cloneCount)(r))
clonedDs.show()

我可能会对数据集ds进行自我联合,但是如果cloneCount很大,例如。 cloneCount = 200000,它是多次在循环中并集的首选解决方案吗?

1 个答案:

答案 0 :(得分:1)

你可以试试这个:

// If the column values are expected to be in an increasing/descresing sequence
// then we add that to the orderBy: clone_index and col_value
// to get the values in order as they were initially

val clonedDs = ds.flatMap(col_value => Range(0, cloneCount)
                   .map(clone_index=>(clone_index,col_value)) )
clonedDs.orderBy("_1", "_2").map(_._2).show()



// If the column values are not expected to follow a sequence
// then we add another rank column and use that in orderBy along with clone_index
// to get the col_values in desired order

val clonedDs = ds.withColumn("rank", monotonically_increasing_id())
    .flatMap(row => Range(0, cloneCount).map(
                clone_index=> (clone_index, row.getLong(1), row.getLong(0))
          ) )

clonedDs.orderBy("_1", "_2").map(_._3).show()