简单的sparksql连接查询丢失执行程序

时间:2016-10-17 07:54:41

标签: scala apache-spark apache-spark-sql

我正在运行一个简单的sparkSQL查询,它在2个数据集上进行匹配,每个数据集大约为500GB。所以整个数据大约是1TB。

val adreqPerDeviceid = sqlContext.sql("select count(Distinct a.DeviceId) as MatchCount from adreqdata1 a inner join adreqdata2  b ON a.DeviceId=b.DeviceId ")
adreqPerDeviceid.cache()
adreqPerDeviceid.show()

工作正常,直到数据加载(分配10k任务)。 在.cache行分配了200个任务。失败的地方!我知道我没有缓存一个巨大的数据只是一个数字,为​​什么它会在这里失败。

以下是错误详情:

  

在   org.apache.spark.scheduler.DAGScheduler.org $阿帕奇$火花$ $调度$$ DAGScheduler failJobAndIndependentStages(DAGScheduler.scala:1283)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1271)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1270)     在   scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)     在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)     在   org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:697)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:697)     在scala.Option.foreach(Option.scala:236)at   org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)     在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)     在   org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)at at   org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)at at   org.apache.spark.SparkContext.runJob(SparkContext.scala:1850)at at   org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:215)     在   org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:207)     在   org.apache.spark.sql.DataFrame $$ anonfun $收集$ 1.适用(DataFrame.scala:1385)     在   org.apache.spark.sql.DataFrame $$ anonfun $收集$ 1.适用(DataFrame.scala:1385)     在   org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:56)     在   org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)     在org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)at   org.apache.spark.sql.DataFrame.head(DataFrame.scala:1314)at   org.apache.spark.sql.DataFrame.take(DataFrame.scala:1377)at   org.apache.spark.sql.DataFrame.showString(DataFrame.scala:178)at at   org.apache.spark.sql.DataFrame.show(DataFrame.scala:401)at   org.apache.spark.sql.DataFrame.show(DataFrame.scala:362)at   org.apache.spark.sql.DataFrame.show(DataFrame.scala:370)at   comScore.DayWiseDeviceIDMatch $ .main(DayWiseDeviceIDMatch.scala:62)at at   comScore.DayWiseDeviceIDMatch.main(DayWiseDeviceIDMatch.scala)at   sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)     在   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:606)at   org.apache.spark.deploy.SparkSubmit $ .ORG $阿帕奇$火花$部署$ SparkSubmit $$ runMain(SparkSubmit.scala:674)     在   org.apache.spark.deploy.SparkSubmit $ .doRunMain $ 1(SparkSubmit.scala:180)     在org.apache.spark.deploy.SparkSubmit $ .submit(SparkSubmit.scala:205)     在org.apache.spark.deploy.SparkSubmit $ .main(SparkSubmit.scala:120)     在org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

2 个答案:

答案 0 :(得分:0)

最有可能的唯一设备ID数量不适合单个执行器的RAM。尝试spark.conf.set('spark.shuffle.partitions', 500)获取500个任务而不是当前的200个。如果查询仍然执行得很糟糕,请再次加倍。

还有什么可以使查询更好地工作,就是按照您要加入的密钥对数据进行排序。

答案 1 :(得分:0)

无论何时在巨大的数据集上进行联接,即从2个数据集的联接中寻找合计值时,群集都需要最小(Dataset1 + Dataset2)大小的硬盘而不是RAM。那么工作将会成功。