SparkSQL根据表达式创建新列

时间:2019-02-21 19:31:55

标签: apache-spark apache-spark-sql

我有一个名为ipTraffic的数据框,其架构如下:

ipTraffic: org.apache.spark.sql.DataFrame = [ip: string, record_count: double]

,我正在尝试制作一个新列,该列采用column的最大值 "record_count"并除以该行的记录计数值。

我已经跑步:

val calc = ipTraffic.agg(max("record_count")) / (ipTraffic("record_count"))
ipTraffic = ipTraffic.withColumn("weight", expr(calc))

val calc = ipTraffic.agg(max("record_count")).divide(ipTraffic("record_count"))
ipTraffic = ipTraffic.withColumn("weight", expr(calc))`

出现错误

error: value / is not a member of org.apache.spark.sql.DataFrame

这对我来说没有意义,因为(肯定是)除法肯定会引起火花,但是我去了https://spark.apache.org/docs/2.3.0/api/sql/并找到了反斜线,其中包括了“ /”。

1 个答案:

答案 0 :(得分:1)

您尝试用列划分数据框:

ipTraffic.agg(max("record_count")):

+-----------------+
|max(record_count)| 
+-----------------+
|              3.0|
+-----------------+ 

除以:

ipTraffic("record_count"):
+------------+
|record_count|
+------------+
|         1.0|
|         2.0|
|         3.0|
|         1.0|
|         2.0|
|         3.0|
+------------+

相反,您可以首先计算最大值,将其获取为文字值,然后在计算中使用它:

import spark.implicits._     
val maxRecordCount = ipTraffic.agg(max($"record_count")).first.getDouble(0)
val ipTrafficWithWeight = ipTraffic.withColumn("weight", lit(maxRecordCount) / $"record_count")