我有两个DataFrame
s:
scala> df1.show()
+----+----+----+---+----+
|col1|col2|col3| |colN|
+----+----+----+ +----+
| 2|null| 3|...| 4|
| 4| 3| 3| | 1|
| 5| 2| 8| | 1|
+----+----+----+---+----+
scala> df2.show() // has one row only (avg())
+----+----+----+---+----+
|col1|col2|col3| |colN|
+----+----+----+ +----+
| 3.6|null| 4.6|...| 2|
+----+----+----+---+----+
和常量val c : Double = 0.1
。
所需的输出是由{/ p>给出的df3: Dataframe
n = numberOfRow 和 m = numberOfColumn 。
我已经查看了sql.function
的列表,并且自己通过一些嵌套的map
操作(担心性能问题)无法自己实现它。我的一个想法是:
val cBc = spark.sparkContext.broadcast(c)
val df2Bc = spark.sparkContext.broadcast(averageObservation)
df1.rdd.map(row => {
for (colIdx <- 0 until row.length) {
val correspondingDf2value = df2Bc.value.head().getDouble(colIdx)
row.getDouble(colIdx) * (1 - cBc.value) + correspondingDf2value * cBc.value
}
})
提前谢谢!
答案 0 :(得分:3)
(cross)join
与select
相结合绰绰有余,并且比映射效率更高。必需的进口:
import org.apache.spark.sql.functions.{broadcast, col, lit}
和表达:
val exprs = df1.columns.map { x => (df1(x) * (1 - c) + df2(x) * c).alias(x) }
join
和select
:
df1.crossJoin(broadcast(df2)).select(exprs: _*)