在我的HDFS中,我收集了大约350个csv文件。每个文件的大小从几KB到250Mb不等。我需要将这些csv文件中的值插入名为RECORD的表中。插入时我也需要引用其他一些表(PARAMETER和FRAME_RATE)。我有以下查询来完成此任务。
-- create external table for the csv files in hdfs
CREATE EXTERNAL TABLE TEMP_CSV(
FRAME_RANK BIGINT,
FRATE BIGINT,
SOURCE STRING,
PARAM STRING,
RECORDEDVALUE STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ';'
location '/user/bala/output'
TBLPROPERTIES ("skip.header.line.count"="2");
-- Now insert fresh values into T_RECORD
INSERT OVERWRITE TABLE RECORD
PARTITION(SESSION)
SELECT DISTINCT
TEMP_CSV.F_FRAME_RANK,
PARAMETER.K_ID,
FRAME_RATE.K_ID,
CAST(TEMP_CSV.RECORDEDVALUE as FLOAT),
split(reverse(split(reverse(TEMP_CSV.INPUT__FILE__NAME),"/")[0]), "[.]")[0] AS SESSION
FROM TEMP_CSV , PARAMETER, FRAME_RATE
WHERE PARAMETER.NAME = TEMP_CSV.PARAM AND FRAME_RATE.FRATE = TEMP_CSV.FRATE;
在我的小型PoC研究中,我有大约50个csv文件,这个查询使用以下配置在大约500秒内成功地将记录填充到RECORD表中
Hive-on-spark
spark standalon
6 nodes in the cluster
4 cores per node / 16gb RAM
spark.executor.memory 2g
但是,当我处理350个文件时,查询失败,执行程序中出现java堆空间错误。所以,我将executor.memory增加到了4g。失败。我将executor.memory增加到6g。失败。最后,我将spark.executor.memory增加到了12g。成功。但是花了大约2小时30分钟。将spark.executor.memory增加到12g导致每个节点只有一个执行程序,因此,只有6个执行程序。
当我的executor.memory为6g时,这是失败时的日志,
******
******
2017-06-12 11:59:09,988 Stage-1_0: 101/101 Finished Stage-2_0: 12/12 Fini shed Stage-3_0: 0(+12,-2)/12
2017-06-12 11:59:12,997 Stage-1_0: 101/101 Finished Stage-2_0: 12/12 Finished Stage-3_0: 0(+12,-2)/12
2017-06-12 11:59:16,004 Stage-1_0: 101/101 Finished Stage-2_0: 12/12 Finished Stage-3_0: 0(+12,-2)/12
2017-06-12 11:59:19,012 Stage-1_0: 101/101 Finished Stage-2_0: 12/12 Finished Stage-3_0: 0(+12,-2)/12
*****
*****
在执行程序中,这是错误日志
17/06/12 11:58:36 WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(5,[Lscala.Tuple2;@e65f7b8,BlockManagerId(5, bndligpu04, 54618))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [50 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:476)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:505)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:505)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:505)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1801)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:505)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [50 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
17/06/12 11:58:36 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 115)
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.orc.impl.OutStream.getNewInputBuffer(OutStream.java:109)
at org.apache.orc.impl.OutStream.write(OutStream.java:130)
at org.apache.orc.impl.RunLengthIntegerWriterV2.writeDeltaValues(RunLengthIntegerWriterV2.java:238)
at org.apache.orc.impl.RunLengthIntegerWriterV2.writeValues(RunLengthIntegerWriterV2.java:186)
at org.apache.orc.impl.RunLengthIntegerWriterV2.write(RunLengthIntegerWriterV2.java:772)
at org.apache.orc.impl.WriterImpl$IntegerTreeWriter.writeBatch(WriterImpl.java:1039)
at org.apache.orc.impl.WriterImpl$StructTreeWriter.writeRootBatch(WriterImpl.java:1977)
at org.apache.orc.impl.WriterImpl.addRowBatch(WriterImpl.java:2759)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushInternalBatch(WriterImpl.java:277)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:296)
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:103)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:743)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:97)
at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processKeyValues(SparkReduceRecordHandler.java:309)
at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:267)
at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:49)
at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:95)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$15.apply(AsyncRDDActions.scala:120)
at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$15.apply(AsyncRDDActions.scala:120)
at org.apache.spark.SparkContext$$anonfun$37.apply(SparkContext.scala:1992)
at org.apache.spark.SparkContext$$anonfun$37.apply(SparkContext.scala:1992)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
17/06/12 11:58:36 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-1,5,main]
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.orc.impl.OutStream.getNewInputBuffer(OutStream.java:109)
at org.apache.orc.impl.OutStream.write(OutStream.java:130)
at org.apache.orc.impl.RunLengthIntegerWriterV2.writeDeltaValues(RunLengthIntegerWriterV2.java:238)
我的问题是: -
解决此问题的任何帮助/信息都将非常有用。还有一个信息,&#39; SELECT&#39;声明工作,我可以在我的色调浏览器中看到结果。当我尝试插入由&#39; SELECT&#39;收集的信息时是查询中断的时间。
答案 0 :(得分:0)
在深入研究日志和表格之后,我做了以下
我删除了RECORD表的'clustering'。早些时候,RECORD被敲打(12个数字),在第二阶段创造了12个任务。为了增加这个数字,我删除了桶。现在,它创造了273个任务。我仍然不知道背后的原因。但是,对于执行程序内存4gb,此配置有效。
我转向火花上配置。这提高了性能。现在,我能够在35米内完成查询。
但是,我发现可能有一两个范围来优化查询。我会尝试加入。
答案 1 :(得分:-1)
您可以尝试增加此作业的执行程序核心。
执行程序核心是执行程序可以运行的并发任务数。工作核心 - 使工作者运行是“CPU核心”。
在Spark中,可以选择在启动从站时设置CPU内核的数量,该从站定义了允许Spark应用程序仅在工作站上的计算机上使用的总CPU内核。默认值为:使用所有可用核心
启动Spark的命令是这样的:
./sbin/start-all.sh --cores 2
或者您可以尝试使用--executor-cores 2