Question

我有两个具有相同行数的DataFrame，但列数根据来源不同而且是动态的。

第一个DataFrame包含所有列，但第二个DataFrame经过筛选和处理，但不包含所有其他列。

需要从第一个DataFrame中选择特定列，并添加/合并第二个DataFrame。

val sourceDf = spark.read.load(parquetFilePath)
val resultDf = spark.read.load(resultFilePath)

val columnName :String="Col1"

我试图以几种方式添加，在这里我只是给了几个......

val modifiedResult = resultDf.withColumn(columnName, sourceDf.col(columnName))

val modifiedResult = resultDf.withColumn(columnName, sourceDf(columnName))
val modifiedResult = resultDf.withColumn(columnName, labelColumnUdf(sourceDf.col(columnName)))

这些都不起作用。

请帮我解决这个问题，将第一个DataFrame的第二个DataFrame合并/添加到列中。

给出示例不是我需要的确切数据结构，但它将满足我解决此问题的要求。

示例输入输出：

Source DataFrame:
+---+------+---+
|InputGas|
+---+------+---+
|1000|
|2000|
|3000|
|4000|
+---+------+---+

Result DataFrame:
+---+------+---+
| Time|CalcGas|Speed|
+---+------+---+
|  0 | 111| 1111|
|  0 | 222| 2222|
|  1 | 333| 3333|
|  2 | 444| 4444|
+---+------+---+

Expected Output:
+---+------+---+
|Time|CalcGas|Speed|InputGas|
+---+------+---+---+
|  0|111 | 1111 |1000|
|  0|222 | 2222 |2000|
|  1|333 | 3333 |3000|
|  2|444 | 4444 |4000|
+---+------+---+---+

Answer 1

使用join

实现此目的的一种方法

如果您在两个数据框中都有一些公共列，那么您可以在该列上执行连接并获得您想要的结果。

示例：

import sparkSession.sqlContext.implicits._ val df1 = Seq((1, "Anu"),(2, "Suresh"),(3, "Usha"), (4, "Nisha")).toDF("id","name") val df2 = Seq((1, 23),(2, 24),(3, 24), (4, 25), (5, 30), (6, 32)).toDF("id","age") val df = df1.as("df1").join(df2.as("df2"), df1("id") === df2("id")).select("df1.id", "df1.name", "df2.age") df.show()

<强>输出：

+---+------+---+ | id| name|age| +---+------+---+ | 1| Anu| 23| | 2|Suresh| 24| | 3| Usha| 24| | 4| Nisha| 25| +---+------+---+

更新

如果您在两个数据框中都没有任何唯一ID，请创建一个并使用它。

import sparkSession.sqlContext.implicits._ import org.apache.spark.sql.functions._ var sourceDf = Seq(1000, 2000, 3000, 4000).toDF("InputGas") var resultDf = Seq((0, 111, 1111), (0, 222, 2222), (1, 333, 3333), (2, 444, 4444)).toDF("Time", "CalcGas", "Speed") sourceDf = sourceDf.withColumn("rowId1", monotonically_increasing_id()) resultDf = resultDf.withColumn("rowId2", monotonically_increasing_id()) val df = sourceDf.as("df1").join(resultDf.as("df2"), sourceDf("rowId1") === resultDf("rowId2"), "inner").select("df1.InputGas", "df2.Time", "df2.CalcGas", "df2.Speed") df.show()

输出：

+--------+----+-------+-----+ |InputGas|Time|CalcGas|Speed| +--------+----+-------+-----+ | 1000| 0| 111| 1111| | 2000| 0| 222| 2222| | 3000| 1| 333| 3333| | 4000| 2| 444| 4444| +--------+----+-------+-----+

将列中的列从一个数据帧添加到另一个数据帧

1 个答案:

更新