Question

我想将来自单独数据框架的两列合并到一个数据框架中

我有两个这样的DataFrames

FS

我想要包含两个列Col1和Col2的第三个数据框

val ds1 = sc.parallelize(Seq(1,0,1,0)).toDF("Col1")
val ds2 = sc.parallelize(Seq(234,43,341,42)).toDF("Col2")
ds1.show()

+-----+
| Col1|
+-----+
|    0|
|    1|
|    0|
|    1|
+-----+

ds2.show()
+-----+
| Col2|
+-----+
|  234|
|   43|
|  341|
|   42|
+-----+

我尝试过工会

+-----++-----+
| Col1|| Col2|
+-----++-----+
|    0||  234|
|    1||   43|
|    0||  341|
|    1||   42|
+-----++-----+

但是，它将val ds3 = ds1.union(ds2)的所有行添加到ds2。

Answer 1

monotonically_increasing_id <-不是确定性。

因此，不能保证您会得到正确结果

使用 RDD 和创建密钥（使用 zipWithIndex

）更容易

val ds1 = sc.parallelize(Seq(1,0,1,0)).toDF("Col1")
val ds2 = sc.parallelize(Seq(234,43,341,42)).toDF("Col2")

// Convert to RDD with ZIPINDEX < Which will be our key

val ds1Rdd = ds1.rdd.repartition(4).zipWithIndex().map{ case (v,k) => (k,v) }

val ds2Rdd = ds2.as[(Int)].rdd.repartition(4).zipWithIndex().map{ case (v,k) => (k,v) }

// Check How The KEY-VALUE Pair looks

ds1Rdd.collect()

res50: Array[(Long, Int)] = Array((0,0), (1,1), (2,1), (3,0))

res51: Array[(Long, Int)] = Array((0,341), (1,42), (2,43), (3,234))

因此，元组的 First 元素是我们的 Join 键

我们只需加入并重新排列以生成数据框

val joinedRdd = ds1Rdd.join(ds2Rdd)

val resultrdd = joinedRdd.map(x => x._2).map(x => (x._1 ,x._2))

// resultrdd: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[204] at map at <console>

然后我们将其转换为 DataFrame

 resultrdd.toDF("Col1","Col2").show()
+----+----+
|Col1|Col2|
+----+----+
|   0| 341|
|   1|  42|
|   1|  43|
|   0| 234|
+----+----+

使用Scala在Spark中合并两列不同的DataFrames

1 个答案: