我有一个将数据插入到Hive外部分区表中的spark sql语句。仅需200k数据,插入过程就需要30分钟以上。
我试图将executor.memoryOverhead增加到4086。仍然在插入语句中看到相同的时间。
这是为执行提供的值。
--executor-cores 4 --executor-memory 3G --num-executors 25 --conf spark.executor.memoryOverhead=4096 --driver-memory 4g
火花代码:
Table_1.createOrReplaceTempView(tempViewName)
config = self.context.get_config()
insert_query = config['tables']['hive']['1']['insertStatement']
insertStatement = insert_query + tempViewName
self.spark.sql(insertStatement)
self.logger.info("************insert completed************")
repairTableQuery = config['tables']['hive']['training']['repairtable']
self.spark.sql(repairTableQuery)
self.logger.info("************repair completed************")
end = datetime.now()```
Would doing a coalesce partition before insert statement help in faster execution.