我正在尝试使用基于其他几列的值创建一个列:
val zzz = sc.parallelize(Seq(("2016-06-23", "VFF", "NO"), ("2016-06-23",
null, "NO"), ("2016-01-23", "VFF", "NO"), ("2016-01-23", null, "NO")))
.toDF("last_ts", "fa_disposition", "vfir_scrap")
val newCol = when(to_date(col("last_ts")) >= "2016-06-01" &&
col("fa_disposition").isNull(), 1)
.when(col("fa_disposition")=="VFF" && col("vfir_scrap")=="NO", -1)
.otherwise(0);
val hdd3=zzz.withColumn("failure", newCol)
然而,我收到错误:
> error: type mismatch;
found : Boolean
required: org.apache.spark.sql.Column
.when(col("fa_disposition")=="VFF" && col("vfir_scrap")=="NO", -1)
我尝试搜索,查看专栏的文档,等等,我不明白这一点。 请帮忙!
答案 0 :(得分:1)
你需要使用.when(col("fa_disposition")==="VFF" && col("vfir_scrap")==="NO", -1)
的{{3}},而不是Scala的等号:
PostsController
答案 1 :(得分:1)
您必须将==
替换为===
(列相等),将isNull()
替换为isNull
:
val zzz = sc.parallelize(Seq(("2016-06-23", "VFF", "NO"), ("2016-06-23",
null, "NO"), ("2016-01-23", "VFF", "NO"), ("2016-01-23", null, "NO")))
.toDF("last_ts", "fa_disposition", "vfir_scrap")
val newCol = when(to_date(col("last_ts")) >= lit("2016-06-01") &&
col("fa_disposition").isNull, 1)
.when(col("fa_disposition")==="VFF" && col("vfir_scrap")==="NO", -1)
.otherwise(0);
val hdd3=zzz.withColumn("failure", newCol)
答案 2 :(得分:0)
以下是使用udf
函数的解决方案。 Udf
函数需要数据序列化和反序列化,并且{strong>不推荐在SQL functions
足以满足解决方案时使用。所以@Raphael Roth的回答是这个案例的理想选择。
此解决方案仅适用于知识库,以上解决方案也可以使用udf函数完成
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val zzz = sc.parallelize(Seq(("2016-06-23", "VFF", "NO"), ("2016-06-23",
null, "NO"), ("2016-01-23", "VFF", "NO"), ("2016-01-23", null, "NO")))
.toDF("last_ts", "fa_disposition", "vfir_scrap")
def failure = udf((last_ts: String, fa_disposition: String, vfir_scrap: String) => {
if((last_ts > "2016-06-01") && fa_disposition == null) 1
else if((fa_disposition == "VFF") && vfir_scrap == "NO") -1
else 0
})
val hdd3 = zzz.withColumn("failure", failure($"last_ts", $"fa_disposition", $"vfir_scrap"))