奴隶迷失,加入火花非常慢

时间:2016-11-16 04:12:39

标签: performance join apache-spark slave

我在一个公共列上连接了两个数据帧,然后运行了show方法:

    df= df1.join(df2, df1.col1== df2.col2, 'inner')
    df.show()

然后加入运行非常缓慢,最后引发错误:奴隶丢失。

    Py4JJavaError: An error occurred while calling o109.showString.

    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 : ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Slave lost

驱动程序堆栈跟踪:          在

  

org.apache.spark.scheduler.DAGScheduler.org $ $阿帕奇火花$ $调度$$ DAGScheduler failJobAndIndependentStages(DAGScheduler.scala:1431)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1419)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1418)     在   scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)     在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)     在   org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:799)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:799)     在scala.Option.foreach(Option.scala:236)at   org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)     在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)     在   org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)at at   org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)at at   org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)at at   org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212)     在   org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)     在   org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)     在   org.apache.spark.sql.DataFrame $$ anonfun $ $组织阿帕奇$火花$ SQL $据帧$$执行$ 1 $ 1.适用(DataFrame.scala:1499)     在   org.apache.spark.sql.DataFrame $$ anonfun $ $组织阿帕奇$火花$ SQL $据帧$$执行$ 1 $ 1.适用(DataFrame.scala:1499)     在   org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:56)     在   org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)     在   org.apache.spark.sql.DataFrame.org $阿帕奇$火花$ SQL $据帧$$执行$ 1(DataFrame.scala:1498)     在   org.apache.spark.sql.DataFrame.org $阿帕奇$火花$ SQL $数据框$$收集(DataFrame.scala:1505)     在   org.apache.spark.sql.DataFrame $$ anonfun $头$ ​​1.适用(DataFrame.scala:1375)     在   org.apache.spark.sql.DataFrame $$ anonfun $头$ ​​1.适用(DataFrame.scala:1374)     在org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)     在org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)at   org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)at   org.apache.spark.sql.DataFrame.showString(DataFrame.scala:170)at at   sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     在   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:498)at   py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)at at   py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)at   py4j.Gateway.invoke(Gateway.java:259)at   py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)     在py4j.commands.CallCommand.execute(CallCommand.java:79)at   py4j.GatewayConnection.run(GatewayConnection.java:209)at   java.lang.Thread.run(Thread.java:745)

经过一些搜索,似乎这是一个与内存相关的问题。然后我增加了重新分配到3000,增加了执行程序内存,增加了memoryOverhead,但仍然没有运气,我得到了同样的奴隶丢失错误。在df.show()期间,我发现其中一个执行程序shuffle写入的大小非常高,其他的并不是那么高。 任何线索? 感谢

1 个答案:

答案 0 :(得分:1)

如果使用scala,请尝试

val df = df1.join(df2,Seq("column name"))

如果是pyspark

df = df1.join(df2,["columnname"])

df = df1.join(df2,df1.columnname == df2.columnname)
display(df)

如果尝试在pyspark中执行相同操作 - sql

df1.createOrReplaceTempView("left_test_table")
df2..createOrReplaceTempView("right_test_table")
left <- sql(sqlContext, "SELECT * FROM left_test_table")
right <- sql(sqlContext, "SELECT * FROM right_test_table")

head(drop(join(left, right), left$name))