使用scala将变换后的列附加到spark数据框

时间:2016-07-03 01:56:22

标签: scala apache-spark spark-dataframe hivecontext

我正在尝试访问配置单元表并从表/数据框中提取和转换某些列,然后将这些新列放在新的数据帧中。 我试图用这种方式来做 -

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

val hiveDF = sqlContext.sql("select * from table_x")

val system_generated_id = hiveDF("unique_key")
val application_assigned_event_id = hiveDF("event_event_id")

val trnEventDf = sqlContext.emptyDataFrame
trnEventDf.withColumn("system_generated_id",lit(system_generated_id))

它与sbt构建时没有任何错误。但是当我尝试运行它时,我收到以下错误 -

  

线程中的异常" main" java.lang.IllegalArgumentException:要求失败       在scala.Predef $ .require(Predef.scala:221)       在org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)       在org.apache.spark.sql.catalyst.analysis.Analyzer $ ResolveReferences $$ anonfun $ apply $ 10 $$ anonfun $ applyOrElse $ 14.apply(Analyzer.scala:354)       在org.apache.spark.sql.catalyst.analysis.Analyzer $ ResolveReferences $$ anonfun $ apply $ 10 $$ anonfun $ applyOrElse $ 14.apply(Analyzer.scala:353)       在scala.collection.TraversableLike $$ anonfun $ flatMap $ 1.apply(TraversableLike.scala:251)       在scala.collection.TraversableLike $$ anonfun $ flatMap $ 1.apply(TraversableLike.scala:251)       在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)       在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)       在scala.collection.TraversableLike $ class.flatMap(TraversableLike.scala:251)       在scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)       在org.apache.spark.sql.catalyst.analysis.Analyzer $ ResolveReferences $$ anonfun $ apply $ 10.applyOrElse(Analyzer.scala:353)       在org.apache.spark.sql.catalyst.analysis.Analyzer $ ResolveReferences $$ anonfun $ apply $ 10.applyOrElse(Analyzer.scala:347)       在org.apache.spark.sql.catalyst.plans.logical.LogicalPlan $$ anonfun $ resolveOperators $ 1.apply(LogicalPlan.scala:57)       在org.apache.spark.sql.catalyst.plans.logical.LogicalPlan $$ anonfun $ resolveOperators $ 1.apply(LogicalPlan.scala:57)       at org.apache.spark.sql.catalyst.trees.CurrentOrigin $ .withOrigin(TreeNode.scala:69)       在org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)       在org.apache.spark.sql.catalyst.analysis.Analyzer $ ResolveReferences $ .apply(Analyzer.scala:347)       在org.apache.spark.sql.catalyst.analysis.Analyzer $ ResolveReferences $ .apply(Analyzer.scala:328)       在org.apache.spark.sql.catalyst.rules.RuleExecutor $$ anonfun $ execute $ 1 $$ anonfun $ apply $ 1.apply(RuleExecutor.scala:83)       在org.apache.spark.sql.catalyst.rules.RuleExecutor $$ anonfun $ execute $ 1 $$ anonfun $ apply $ 1.apply(RuleExecutor.scala:80)       在scala.collection.LinearSeqOptimized $ class.foldLeft(LinearSeqOptimized.scala:111)       在scala.collection.immutable.List.foldLeft(List.scala:84)       在org.apache.spark.sql.catalyst.rules.RuleExecutor $$ anonfun $ execute $ 1.apply(RuleExecutor.scala:80)       在org.apache.spark.sql.catalyst.rules.RuleExecutor $$ anonfun $ execute $ 1.apply(RuleExecutor.scala:72)       在scala.collection.immutable.List.foreach(List.scala:318)       在org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)       在org.apache.spark.sql.execution.QueryExecution.analyzed $ lzycompute(QueryExecution.scala:36)       在org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)       at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)       在org.apache.spark.sql.DataFrame。(DataFrame.scala:133)       在org.apache.spark.sql.DataFrame.org $ apache $ spark $ sql $ DataFrame $$ withPlan(DataFrame.scala:2126)       在org.apache.spark.sql.DataFrame.select(DataFrame.scala:707)       在org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1188)       在培根$ .main(bacon.scala:31)       在bacon.main(bacon.scala)       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)       at java.lang.reflect.Method.invoke(Method.java:606)       在org.apache.spark.deploy.SparkSubmit $ .org $ apache $ spark $ deploy $ SparkSubmit $$ runMain(SparkSubmit.scala:731)

我想了解导致此错误的原因以及是否有其他方法可以完成我想要做的事情。

1 个答案:

答案 0 :(得分:1)

通常,您不需要为此创建新的df。当您通过向其添加唯一ID来转换df时,您将获得所需的df。如果您想保存它,只需将其保存为新的配置单元表。