我正在使用以下语句和配置运行spark sql,但显然dfs.reduce((x, y) => x.union(y)).distinct().coalesce(1)
步骤需要花费大量时间来执行,大约5分钟,我的输入镶木地板文件只有88条记录。有什么想法可能是什么问题?
val spark = SparkSession
.builder()
.appName("SparkSessionZipsExample")
.config("spark.sql.warehouse.dir", warehouseLocation)
.config("spark.master", "local")
.config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
//set new runtime options
spark.conf.set("spark.sql.shuffle.partitions", 6)
spark.conf.set("spark.executor.memory", "2g")
spark.conf.set("spark.driver.host", "localhost")
spark.conf.set("spark.cores.max", "8")
val dfs = m.map(field => spark.sql(s"select 'DataProfilerStats' as Table_Name,
'$field' as Column_Name,min($field) as min_value from parquetDFTable"))
val withSum = dfs.reduce((x, y) => x.union(y)).distinct().coalesce(1)
更新
我有一个单一的镶木地板,我正在阅读数据框,问题还在于它是否可以分成更小的块。