时间:2017-04-29 19:02:07

标签: scala apache-spark apache-spark-sql

我一直在尝试使用此函数将timestamp / currentdate添加到我的数据框

val myDF = dataframe.toDF()
        import org.apache.spark.sql.functions.{ col, lit, when }
        val currDate = new java.util.Date()
        myDF.withColumn("CreatedAt", lit(new java.sql.Date(currDate.getDate)))

但是在编译之后,spark-submit作业失败并出现以下异常:

  

线程中的异常" main" java.lang.IllegalArgumentException异常:   要求失败           在scala.Predef $ .require(Predef.scala:221)           在org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)           在org.apache.spark.sql.catalyst.analysis.Analyzer $ ResolveReferences $$ anonfun $ apply $ 10 $$ anonfun $ applyOrElse $ 14.apply(Analyzer.scala:354)           在org.apache.spark.sql.catalyst.analysis.Analyzer $ ResolveReferences $$ anonfun $ apply $ 10 $$ anonfun $ applyOrElse $ 14.apply(Analyzer.scala:353)           在scala.collection.TraversableLike $$ anonfun $ flatMap $ 1.apply(TraversableLike.scala:251)           在scala.collection.TraversableLike $$ anonfun $ flatMap $ 1.apply(TraversableLike.scala:251)           在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)           在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)           在scala.collection.TraversableLike $ class.flatMap(TraversableLike.scala:251)           在scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)           在org.apache.spark.sql.catalyst.analysis.Analyzer $ ResolveReferences $$ anonfun $ apply $ 10.applyOrElse(Analyzer.scala:353)           在org.apache.spark.sql.catalyst.analysis.Analyzer $ ResolveReferences $$ anonfun $ apply $ 10.applyOrElse(Analyzer.scala:347)           在org.apache.spark.sql.catalyst.plans.logical.LogicalPlan $$ anonfun $ resolveOperators $ 1.apply(LogicalPlan.scala:57)           在org.apache.spark.sql.catalyst.plans.logical.LogicalPlan $$ anonfun $ resolveOperators $ 1.apply(LogicalPlan.scala:57)           at org.apache.spark.sql.catalyst.trees.CurrentOrigin $ .withOrigin(TreeNode.scala:69)           在org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)           在org.apache.spark.sql.catalyst.analysis.Analyzer $ ResolveReferences $ .apply(Analyzer.scala:347)           在org.apache.spark.sql.catalyst.analysis.Analyzer $ ResolveReferences $ .apply(Analyzer.scala:328)           在org.apache.spark.sql.catalyst.rules.RuleExecutor $$ anonfun $ execute $ 1 $$ anonfun $ apply $ 1.apply(RuleExecutor.scala:83)           在org.apache.spark.sql.catalyst.rules.RuleExecutor $$ anonfun $ execute $ 1 $$ anonfun $ apply $ 1.apply(RuleExecutor.scala:80)           在scala.collection.LinearSeqOptimized $ class.foldLeft(LinearSeqOptimized.scala:111)           在scala.collection.immutable.List.foldLeft(List.scala:84)           在org.apache.spark.sql.catalyst.rules.RuleExecutor $$ anonfun $ execute $ 1.apply(RuleExecutor.scala:80)           在org.apache.spark.sql.catalyst.rules.RuleExecutor $$ anonfun $ execute $ 1.apply(RuleExecutor.scala:72)           在scala.collection.immutable.List.foreach(List.scala:318)           在org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)           在org.apache.spark.sql.execution.QueryExecution.analyzed $ lzycompute(QueryExecution.scala:36)           在org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)           at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)           在org.apache.spark.sql.DataFrame。(DataFrame.scala:133)           在org.apache.spark.sql.DataFrame.org $ apache $ spark $ sql $ DataFrame $$ withPlan(DataFrame.scala:2126)           在org.apache.spark.sql.DataFrame.select(DataFrame.scala:707)           在org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1188)           在App_Event_1173 $$ anonfun $ main $ 1.apply(App_Event_1173.scala:71)           在App_Event_1173 $$ anonfun $ main $ 1.apply(App_Event_1173.scala:61)           在org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ apply $ mcV $ sp $ 3.apply(DStream.scala:661)           在org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ apply $ mcV $ sp $ 3.apply(DStream.scala:661)           在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply $ mcV $ sp(ForEachDStream.scala:50)           在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(ForEachDStream.scala:50)           在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(ForEachDStream.scala:50)           在org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)           在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.apply $ mcV $ sp(ForEachDStream.scala:49)           在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.apply(ForEachDStream.scala:49)           在org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.apply(ForEachDStream.scala:49)           在scala.util.Try $ .apply(Try.scala:161)           在org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)           在org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $ run $ 1.apply $ mcV $ sp(JobScheduler.scala:224)           在org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $ run $ 1.apply(JobScheduler.scala:224)           在org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $ run $ 1.apply(JobScheduler.scala:224)           在scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)           在org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler.run(JobScheduler.scala:223)           在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)           at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:615)           在java.lang.Thread.run(Thread.java:745)

我有java版本1.7,spark 1.6.1,scala 2.10.5。我做错了。

试过 myDF.withColumn("CreatedAt", lit(current_date))也是如此。 没有任何效果。

1 个答案:

答案 0 :(得分:0)

我使用java 8scala 2.11.8spark 2.1.0尝试了您的代码。您的代码运行完美 所以我希望你能做到以下几点:

1.确认您正在使用相同版本的scala和java,以及使用编译spark。您可以在执行spark-shell时简单地获取此信息。

2. @Vidya说你需要摆脱java 7,因为spark更喜欢java 8 。因此,即使您解决了这个问题,您将来也会遇到其他错误。

顺便说一下,尝试使用最新版本的spark并使用scalajava来编译您下载的火花。
以下是我的所作所为:

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}

object Test {
  def main(args: Array[String]) {
    val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
    val sqlContext = new SQLContext(sc)

    import sqlContext.implicits._
    val df = Seq(("David", 25, ""), ("Jean", 20, "")).toDF("Name", "age", "uid")

    val myDF = df.toDF()
    import org.apache.spark.sql.functions.{lit}
    val currDate = new java.util.Date()
    myDF.withColumn("CreatedAt", lit(new java.sql.Date(currDate.getDate))).show
  }
}