运行火花作业时GC错误

时间:2017-06-14 06:33:01

标签: scala apache-spark apache-spark-sql spark-dataframe

我正在运行以下spark代码,每次内存不足时。数据大小并不大但加班时它会引发GC错误,因此garbage collector收集的对象太多。从表中选择少量列和数据不应包含太多开销,并在堆上创建太多对象。我通过触发选择查询创建了太多不可变对象。不确定为什么它抱怨GC错误。

object O {  

def main(args: Array[String]): Unit = {  

val log = LogManager.getRootLogger    
val TGTTBL = "XYZ"  
val outputLocation = "somepath"  
val dql = "select col1, col2, col3 from SOURCETBL where condition"
val appName = "createDFS";
val spark = SparkSession.builder()
.appName(appName)
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()

 log.warn("Select Query........................")
 log.warn(dql)
 val tblDF = spark.sql(dql)

 def getCols(df: DataFrame): String = {
 val cols = df.columns;
 val colString = cols.map(c => s"$c    


${df.schema(s"$c").dataType}").dropRight(3).mkString(",").replace("Type", "")

 return colString;
 }

 val colString = getCols(tblDF)
 log.warn("Create Table def........................")
 log.warn(colString)
 spark.sql(s"drop table if exists $TGTTBL")
 spark.sql(s"Create external table if not exists $TGTTBL ($colString)" +
        s" partitioned by (col1 string, col2 string, col3 string) stored as orc location \'$outputLocation\'")

 tblDF.write.partitionBy("col1", "col2", ).format("orc").mode("overwrite").save(outputLocation)

}
}


**Error - 
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded**

1 个答案:

答案 0 :(得分:0)

我在尝试阅读spark sql时一直遇到上述错误。所以我从输入文件创建了一个RDD [Object],并使用rdd.toDF()方法将其转换为dataframe。这解决了我的问题。