我正在运行以下spark
代码,每次内存不足时。数据大小并不大但加班时它会引发GC
错误,因此garbage collector
收集的对象太多。从表中选择少量列和数据不应包含太多开销,并在堆上创建太多对象。我通过触发选择查询创建了太多不可变对象。不确定为什么它抱怨GC
错误。
object O {
def main(args: Array[String]): Unit = {
val log = LogManager.getRootLogger
val TGTTBL = "XYZ"
val outputLocation = "somepath"
val dql = "select col1, col2, col3 from SOURCETBL where condition"
val appName = "createDFS";
val spark = SparkSession.builder()
.appName(appName)
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()
log.warn("Select Query........................")
log.warn(dql)
val tblDF = spark.sql(dql)
def getCols(df: DataFrame): String = {
val cols = df.columns;
val colString = cols.map(c => s"$c
${df.schema(s"$c").dataType}").dropRight(3).mkString(",").replace("Type", "")
return colString;
}
val colString = getCols(tblDF)
log.warn("Create Table def........................")
log.warn(colString)
spark.sql(s"drop table if exists $TGTTBL")
spark.sql(s"Create external table if not exists $TGTTBL ($colString)" +
s" partitioned by (col1 string, col2 string, col3 string) stored as orc location \'$outputLocation\'")
tblDF.write.partitionBy("col1", "col2", ).format("orc").mode("overwrite").save(outputLocation)
}
}
**Error -
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded**
答案 0 :(得分:0)
我在尝试阅读spark sql时一直遇到上述错误。所以我从输入文件创建了一个RDD [Object],并使用rdd.toDF()方法将其转换为dataframe。这解决了我的问题。