我在spark中运行一个应用程序,它在两个数据帧之间进行简单的差异处理。 我在我的集群环境中执行jar文件。 我的集群环境是94节点集群。 有两个数据集2 GB和4 GB映射到数据帧。
我的工作对于非常小的文件工作正常......
我个人认为saveAsTextFile
在我的申请中需要更多时间
在我的群集下面详细说明
Total Vmem allocated for Containers 394.80 GB
Total Vmem allocated for Containers 394.80 GB
Total VCores allocated for Containers 36
这就是我如何运行我的火花工作
spark-submit --queue root.queue --deploy-mode client --master yarn SparkApplication-SQL-jar-with-dependencies.jar
这是我的代码。
object TestDiff {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("WordCount");
conf.set("spark.executor.memory", "32g")
conf.set("spark.driver.memory", "32g")
conf.set("spark.driver.maxResultSize", "4g")
val sc = new SparkContext(conf); //Creating spark context
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{ StructType, StructField, StringType, DoubleType, IntegerType }
import org.apache.spark.sql.functions.udf
val schema = StructType(Array(
StructField("filler1", StringType),
StructField("dunsnumber", StringType),
StructField("transactionalindicator", StringType)))
import org.apache.spark.sql.functions._
val textRdd1 = sc.textFile("/home/cloudera/TRF/PCFP/INCR")
val rowRdd1 = textRdd1.map(line => Row.fromSeq(line.split("\\|", -1)))
var df1 = sqlContext.createDataFrame(rowRdd1, schema)
val textRdd2 = sc.textFile("/home/cloudera/TRF/PCFP/MAIN")
val rowRdd2 = textRdd2.map(line => Row.fromSeq(line.split("\\|", -1)))
var df2 = sqlContext.createDataFrame(rowRdd2, schema)
//Finding the diff between two if any of the columns has changed
val diffAnyColumnDF = df1.except(df2)
diffAnyColumnDF.rdd.coalesce(1).saveAsTextFile("Diffoutput")
}
}
它需要超过30分钟才会失败。
以下例外
这是日志
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
17/09/15 11:55:01 WARN netty.NettyRpcEnv: Ignored message: HeartbeatResponse(false)
17/09/15 11:56:19 WARN netty.NettyRpcEndpointRef: Error sending message [message = Heartbeat(1,[Lscala.Tuple2;@7fe57079,BlockManagerId(1, c755kds.int.int.com, 33507))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:491)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:520)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:520)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:520)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1818)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:520)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
17/09/15 11:56:19 WARN netty.NettyRpcEnv: Ignored message: HeartbeatResponse(false)
请建议如何调整我的火花工作?
我刚刚更改了执行程序内存,但是它的工作成功了,但速度非常慢。
conf.set("spark.executor.memory", "64g")
但是工作很慢......完成大约需要15分钟......
工作需要15分钟才能完成。
附加DAG可视化
增加时间超出错误的时间..
executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 175200 ms
答案 0 :(得分:1)
我认为您的单个分区文件大小很大。通过TCP通道传输数据需要很长时间,连接无法长时间处于活动状态并重置。
你能合并到更多的分区吗?