Spark Scala中数据帧的行内容的条件

时间:2017-07-05 07:48:04

标签: scala apache-spark dataframe apache-spark-sql

我有以下数据框:

+--------+---------+------+
|  value1| value2  |value3|
+--------+---------+------+
|   a    |  2      |   3  |
+--------+---------+------+
|   b    |  5      |   4  |
+--------+---------+------+
|   b    |  5      |   4  |
+--------+---------+------+
|   c    |  3      |   4  |
+--------+---------+------+

我想在value1 = b时输入行的value2 / value3的结果,然后为所有行 添加它(即使是不属于b的行) 在名为“result”的新字段中。这意味着必须将另一列添加到数据框中。例如,对于所有行,应将5/4的结果(我选择它,因为它是b)添加到数据帧中。我知道,我应该使用这段代码:

 val dataframe_new = Dataframe.withColumn("result", $"value1" / $"value2")
 Dataframe.show()

但是,我怎么能以这样的方式放置条件,它将它添加到所有行。输出应如下所示:

+---+---+---+------+
| v1| v2| v3|result|
+---+---+---+------+
|  a|  2|  3|  1.25|
|  b|  5|  4|  1.25|
|  b|  5|  4|  1.25|
|  c|  3|  4|  1.25|
+---+---+---+------+
你能帮帮我吗?提前致谢。

2 个答案:

答案 0 :(得分:7)

您只需使用when

scala> val df = Seq(("a",2,3),("b",5,4),("b",5,4),("c",3,4)).toDF("v1","v2","v3")
df: org.apache.spark.sql.DataFrame = [v1: string, v2: int ... 1 more field]

scala> df.withColumn("result", when($"v1" === "b" , ($"v2"/$"v3"))).show
+---+---+---+------+
| v1| v2| v3|result|
+---+---+---+------+
|  a|  2|  3|  null|
|  b|  5|  4|  1.25|
|  b|  5|  4|  1.25|
|  c|  3|  4|  null|
+---+---+---+------+

您可以嵌入多个when,如下所示:

scala> df.withColumn("result", when($"v1" === "b" , ($"v2"/$"v3")).
     |    otherwise(when($"v1" === "a", $"v3"/$"v2"))).show
+---+---+---+------+
| v1| v2| v3|result|
+---+---+---+------+
|  a|  2|  3|   1.5|
|  b|  5|  4|  1.25|
|  b|  5|  4|  1.25|
|  c|  3|  4|  null|
+---+---+---+------+

编辑:您似乎还需要其他内容,其中v1的条件始终具有相同的值v2v3,这样我们就可以执行以下操作:

使用 Spark 2 +

scala> val res = df.filter($"v1" === lit("b")).distinct.select($"v2"/$"v3").as[Double].head
res: Double = 1.25

Spark< 2

之前
scala> val res = df.filter($"v1" === lit("b")).distinct.withColumn("result",$"v2"/$"v3").rdd.map(_.getAs[Double]("result")).collect()(0)
res: Double = 1.25                                                              

scala> df.withColumn("v4", lit(res)).show
+---+---+---+----+
| v1| v2| v3|  v4|
+---+---+---+----+
|  a|  2|  3|1.25|
|  b|  5|  4|1.25|
|  b|  5|  4|1.25|
|  c|  3|  4|1.25|
+---+---+---+----+

答案 1 :(得分:1)

答案与eliasah几乎相似但有不同的味道。我正在写它,以便其他人也可以从这种方法中受益

import sqlContext.implicits._

val df = Seq(
  ("a", 2, 3),
  ("b", 5, 4),
  ("b", 5, 4),
  ("c", 3, 4)
).toDF("value1", "value2", "value3")

应该有

+------+------+------+
|value1|value2|value3|
+------+------+------+
|a     |2     |3     |
|b     |5     |4     |
|b     |5     |4     |
|c     |3     |4     |
+------+------+------+

df.withColumn("result", lit(data.filter($"value1" === "b").select($"value2"/$"value3").first.get(0)))

应该生成输出

+------+------+------+------+
|value1|value2|value3|result|
+------+------+------+------+
|a     |2     |3     |1.25  |
|b     |5     |4     |1.25  |
|b     |5     |4     |1.25  |
|c     |3     |4     |1.25  |
+------+------+------+------+