由于SparkR的Spark SQL模块,我尝试测试Window function
。
我使用Spark 1.6,并尝试以两种不同的部署模式(local
和yarn-client
)复制zero323提供的示例。
set.seed(1)
hc <- sparkRHive.init(sc)
sdf <- createDataFrame(hc, data.frame(x=1:12, y=1:3, z=rnorm(12)))
registerTempTable(sdf, "sdf")
query <- sql(hc, "SELECT x, y, z, LAG(z) OVER (PARTITION BY y ORDER BY x) FROM sdf")
head(query)
## x y z _c3
## 1 1 1 -0.6264538 NA
## 2 4 1 1.5952808 -0.6264538
## 3 7 1 0.4874291 1.5952808
## 4 10 1 -0.3053884 0.4874291
## 5 2 2 0.1836433 NA
## 6 5 2 0.3295078 0.1836433
但是对于这两种部署模式,当我执行Spark Action head(query)
时出现同样的错误:
16/01/21 18:03:17 ERROR r.RBackendHandler: dfToCols on org.apache.spark.sql.api.r.SQLUtils failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
at org.apache.spark.sql.execution.Window.doExecute(Window.scala:245)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.
我将这个HQL查询直接用于HIVE并正常工作。同样&#34;正常&#34;像classical_query <- sql(hc, "SELECT * FROM sdf")
head(classical_query)
这样的查询工作正常。
THX
答案 0 :(得分:0)
我解决了我的问题。 这只是一个Spark配置问题。
我刚刚从/usr/hdp/current/hive-client/lib/hive-exec.jar
配置文件中的spark.driver.extraClassPath
变量中删除了spark-defaults.conf
JAR。