Question

我有两个具有不同类型列的数据帧。我需要加入这两个不同的数据帧。请参考以下示例

val df1 has
Customer_name 
Customer_phone
Customer_age

val df2 has
Order_name
Order_ID

这两个数据框没有任何公共列。两个数据帧中的行数和列数也不同。我尝试插入一个新的虚拟列来增加row_index值，如下所示 val dfr = df1.withColumn（＆＃34; row_index＆＃34;，monotonically_increasing_id（））。

但是当我使用spark-2时，monotonically_increasing_id方法不支持我。有没有办法加入两个数据帧。这样我就可以在一张excel文件中创建两个数据帧的值。

例如

val df1:
Customer_name  Customer_phone  Customer_age
karti           9685684551     24      
raja            8595456552     22

val df2:
Order_name Order_ID
watch       1
cattoy     2

我的最终excel表应该是这样的：

Customer_name  Customer_phone  Customer_age   Order_name  Order_ID

karti          9685684551      24             watch        1

raja           8595456552      22             cattoy      2

Answer 1

使用以下代码向两个数据框添加索引列

df1.withColumn("id1",monotonicallyIncreasingId)
df2.withColumn("id2",monotonicallyIncreasingId)

然后使用以下代码将两个数据框连接在一起，并删除索引列

df1.join(df2,col("id1")===col("id2"),"inner")
   .drop("id1","id2")

Answer 2

id 增加和唯一，但不是 连续< / EM>

您可以使用monotonically_increasing_id()转换为zipWithIndex并使用rdd的相同架构重建Dataframe。

dataframe

现在加入最终的数据帧

import spark.implicits._ val df1 = Seq( ("karti", "9685684551", 24), ("raja", "8595456552", 22) ).toDF("Customer_name", "Customer_phone", "Customer_age") val df2 = Seq( ("watch", 1), ("cattoy", 2) ).toDF("Order_name", "Order_ID") val df11 = spark.sqlContext.createDataFrame( df1.rdd.zipWithIndex.map { case (row, index) => Row.fromSeq(row.toSeq :+ index) }, // Create schema for index column StructType(df1.schema.fields :+ StructField("index", LongType, false)) ) val df22 = spark.sqlContext.createDataFrame( df2.rdd.zipWithIndex.map { case (row, index) => Row.fromSeq(row.toSeq :+ index) }, // Create schema for index column StructType(df2.schema.fields :+ StructField("index", LongType, false)) )

输出：

df11.join(df22, Seq("index")).drop("index")

连接两个数据帧而没有公共列spark，scala

2 个答案: