Question

我有两个DataFrames（Spark 2.2.0和Scala 2.11.8）。第一个DataFrame df1有一列名为col1，第二列df2也有一列名为col2。两个DataFrame中的行数相等。

如何将这两列合并到一个新的DataFrame中？

我尝试了join，但我认为还有其他方法可以做到这一点。

另外，我尝试应用withColumm，但它没有编译。

val result = df1.withColumn(col("col2"), df2.col1)

更新

例如：

df1 = 
col1
1
2
3

df2 = 
col2
4
5
6

result = 
col1  col2
1     4
2     5
3     6

Answer 1

如果这两列之间没有实际关系，听起来你需要union运算符，它将返回这两个数据帧的并集：

var df1 = Seq("a", "b", "c").toDF("one")
var df2 = Seq("d", "e", "f").toDF("two")

df1.union(df2).show

+---+ 
|one| 
+---+ 
| a | 
| b | 
| c | 
| d | 
| e | 
| f | 
+---+

[编辑] 现在您已经明确表示只需要两列，然后使用DataFrames，您可以使用函数monotonically_increasing_id（）添加行索引并连接该索引值：

import org.apache.spark.sql.functions.monotonically_increasing_id

var df1 = Seq("a", "b", "c").toDF("one")
var df2 = Seq("d", "e", "f").toDF("two")

df1.withColumn("id", monotonically_increasing_id())
    .join(df2.withColumn("id", monotonically_increasing_id()), Seq("id"))
    .drop("id")
    .show

+---+---+ 
|one|two|
+---+---+ 
| a | d | 
| b | e | 
| c | f |
+---+---+

Answer 2

取决于你想做什么。

如果要合并两个DataFrame，则应使用连接。关系代数（或任何DBMS）中有相同的连接类型

您说您的数据框每个只有一列。

在这种情况下，你可能想要一个交叉连接（笛卡尔积）给你一个两列表col1和col2的所有可能组合，或者你可能想要uniao（由@Chondrops提到）女巫给你一个包含所有元素的一列表。

我认为所有其他连接类型的使用都可以在spark中完成专门的操作（在这种情况下，两个数据框各有一列）。

Answer 3

据我所知，想要使用DataFrames的唯一方法是使用RDD.zipWithIndex向每个索引列添加一个索引列，然后在索引列上进行连接。可以在this SO answer。

中找到在DataFrame上执行zipWithIndex的代码

但是，如果DataFrame很小，那么collect驱动程序中的两个DF zip将它们放在一起会更加简单，并将结果转换为新的DataFrame。

[以驱动程序收集/ zip文件为例进行更新]

val df3 = spark.createDataFrame(df1.collect() zip df2.collect()).withColumnRenamed("_1", "col1").withColumnRenamed("_2", "col2")

如何将两列合并为一个新的DataFrame？

3 个答案: