Scala& Spark:为每行的每个单元格添加值

时间:2017-03-10 10:06:30

标签: scala apache-spark

我有两个DataFrame s:

scala> df1.show()
+----+----+----+---+----+
|col1|col2|col3|   |colN|
+----+----+----+   +----+
|   2|null|   3|...|   4|
|   4|   3|   3|   |   1|
|   5|   2|   8|   |   1|
+----+----+----+---+----+

scala> df2.show() // has one row only (avg())
+----+----+----+---+----+
|col1|col2|col3|   |colN|
+----+----+----+   +----+
| 3.6|null| 4.6|...|   2|
+----+----+----+---+----+

和常量val c : Double = 0.1

所需的输出是由{/ p>给出的df3: Dataframe

entries of df3

n = numberOfRow m = numberOfColumn

我已经查看了sql.function的列表,并且自己通过一些嵌套的map操作(担心性能问题)无法自己实现它。我的一个想法是:

val cBc = spark.sparkContext.broadcast(c)
val df2Bc = spark.sparkContext.broadcast(averageObservation)
df1.rdd.map(row => {
   for (colIdx <- 0 until row.length) {
      val correspondingDf2value = df2Bc.value.head().getDouble(colIdx)

      row.getDouble(colIdx) * (1 - cBc.value) + correspondingDf2value * cBc.value
   }
})

提前谢谢!

1 个答案:

答案 0 :(得分:3)

(cross)joinselect相结合绰绰有余,并且比映射效率更高。必需的进口:

import org.apache.spark.sql.functions.{broadcast, col, lit}

和表达:

val exprs = df1.columns.map { x => (df1(x) * (1 - c) +  df2(x) * c).alias(x) }

joinselect

df1.crossJoin(broadcast(df2)).select(exprs: _*)