Question

我使用Spark版本1.6.0，Python版本2.6.6

我有一个pyspark脚本：

conf = SparkConf().setAppName("Log Analysis")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

loadFiles=sc.wholeTextFiles("hdfs:///locations")

fileWiseData=loadFiles.flatMap(lambda inpFile : inpFile[1].split("\n\n"))
replaceNewLine=fileWiseData.map(lambda lines:lines.replace("\n",""))
filterLines=replaceNewLine.map(lambda lines:lines.replace("/"," ")) 
errorEntries =filterLines.filter(lambda errorLines : "Error" in errorLines) 

errEntry= errorEntries.map(lambda line: gettingData(line))#formatting the data 

ErrorFiltered = Row('ExecutionTimeStamp','ExecutionDate','ExecutionTime','ExecutionEpoch','ErrorNum','Message')
errorData = errEntry.map(lambda r: ErrorFiltered(*r))

errorDataDf = sqlContext.createDataFrame(errorData)

` 当我将我的1gb日志文件拆分为20mb（通常，或者对于30,40..mbs文件的拆分）执行脚本时，脚本部分工作正常。

  spark-submit --jars /home/hpuser/LogAnaysisPOC/packages/spark-csv_2.10-1.5.0.jar,/home/hpuser/LogAnaysisPOC/packages/commons-csv-1.1.jar \
--master yarn-cluster --driver-memory 6g --executor-memory 6g --conf spark.yarn.driver.memoryOverhead=4096 \
--conf spark.yarn.executor.memoryOverhead=4096 \
/home/user/LogAnaysisPOC/scripts/essbase/Essbaselog.py

1）如果我尝试以1gb作为输入执行，一次失败（errorDataDf = sqlContext.createDataFrame（errorData））。

2）我需要将解析后的数据加入到一个元数据数据帧中，该数据帧在43mb左右。 dfinal.repartition(1).write.format("com.databricks.spark.csv").save("/user/user/loganalysis")

它再次适用于分割数据并且一次性失败。

作业执行失败并出现错误： java.lang.OutOfMemoryError：请求的数组大小超过VM限制

Yarn调度程序设置如下：

yarn.scheduler.capacity.root.queues=default,hive1,hive2
yarn.scheduler.capacity.root.default.user-limit-factor=1
yarn.scheduler.capacity.root.default.state=RUNNING
yarn.scheduler.capacity.root.default.maximum-capacity=100
yarn.scheduler.capacity.root.default.capacity=50
yarn.scheduler.capacity.root.default.acl_submit_applications=*
yarn.scheduler.capacity.root.capacity=100
yarn.scheduler.capacity.root.acl_administer_queue=*
yarn.scheduler.capacity.root.accessible-node-labels=*
yarn.scheduler.capacity.node-locality-delay=40
yarn.scheduler.capacity.maximum-applications=10000
yarn.scheduler.capacity.maximum-am-resource-percent=0.5
yarn.scheduler.capacity.queue-mappings-override.enable=false
yarn.scheduler.capacity.root.default.minimum-user-limit-percent=25
yarn.scheduler.capacity.root.default.ordering-policy=fifo
yarn.scheduler.capacity.root.hive1.acl_administer_queue=*
yarn.scheduler.capacity.root.hive1.acl_submit_applications=*
yarn.scheduler.capacity.root.hive1.capacity=25
yarn.scheduler.capacity.root.hive1.maximum-capacity=100
yarn.scheduler.capacity.root.hive1.minimum-user-limit-percent=25
yarn.scheduler.capacity.root.hive1.ordering-policy=fifo
yarn.scheduler.capacity.root.hive1.state=RUNNING
yarn.scheduler.capacity.root.hive1.user-limit-factor=1
yarn.scheduler.capacity.root.hive2.acl_administer_queue=*
yarn.scheduler.capacity.root.hive2.acl_submit_applications=*
yarn.scheduler.capacity.root.hive2.capacity=25
yarn.scheduler.capacity.root.hive2.maximum-capacity=100
yarn.scheduler.capacity.root.hive2.minimum-user-limit-percent=25
yarn.scheduler.capacity.root.hive2.ordering-policy=fifo
yarn.scheduler.capacity.root.hive2.state=RUNNING
yarn.scheduler.capacity.root.hive2.user-limit-factor=1
yarn.scheduler.capacity.root.user-limit-factor=1

cluster details

我在forum

中提出了同样的问题

非常感谢任何形式的建议。

spark作业无法在纱线群集模式下执行

0 个答案: