Question

我有两个数据集

数据集1：

id  a    b      c     d
1  0.3  0.1   0.2   0.2
2  0.2  0.3   0.3   0.4
3  0.2  0.4   0.7   0.7
....

dataset2

我想做一个操作，使用＆＃34; x＆＃34;数据集2中的列对dataset1中的每一列进行计数，每个id为foe，以便所需的输出为：

id   a    b    c    d
1   2.4   0.8  1.6  1.6
2   0.8   1.2  1.2  1.6
3    2     4    7    7

我所做的是通过加入dataset2

来映射数据集1中的每一行

val result = dataset1.join(dataset2, Seq("id")
                     .map(row=> row.getAs[String]("id"),
                          row=> row.getAs[Double]("a") * row.getAs[Int]("x"),
                          row=> row.getAs[Double]("b") * row.getAs[Int]("x"),
                          row=> row.getAs[Double]("c") * row.getAs[Int]("x"),
                          row=> row.getAs[Double]("d") * row.getAs[Int]("x"))

我觉得这样的写作有点多余。有没有办法让它更清楚？

Answer 1

您需要的只是select：

dataset1.join(dataset2, Seq("id")).select(
  $"id", $"a" * $"x", $"b" * $"x", $"c" * $"x", $"d" * "x"
).toDF("id", "a", "b", "c", "d")

可以推广

val exprs = $"id" +: dataset1.columns.tail.map(c => (col(c) * $"x").alias(c))
dataset1.join(dataset2, Seq("id")).select(exprs: _*)

在Spark 2.0中进行列级操作的有效方法

1 个答案: