Question

我试图在下面（在Hive表上）执行以下查询，但是由于某种原因，它甚至在开始执行之前就挂起了。将其粘贴到REPL后挂起，并且WEB UI中也没有任何显示。

Spark shell使用以下参数启动：

$ spark-shell --num-executors=10 --executor-cores=3 --executor-memory=16G --conf spark.sql.adaptive.enabled=true

testtable有一些记录，而testtable_stg大约有1亿条记录。查询中有15列，为简化起见，这里仅包含6列。

{
     {
       spark.table("testtable_stg")
         .selectExpr(
           """md5(concat(coalesce(nullif(test1,'null'),'val'),
             coalesce(nullif(test2,'null'),'val'),
             coalesce(nullif(test3,'null'),'val'),
             coalesce(nullif(test4,'null'),'val'),
             cast(coalesce(test5,'2222-22-22') as date),
           )) as sk""",
           "coalesce(nullif(test1,'null'),'val') as test1",
           "coalesce(nullif(test2,'null'),'val') as test2",
           "coalesce(nullif(test3,'null'),'val') as test3",
           "coalesce(nullif(test4,'null'),'val') as test4",
           "cast(coalesce(test5,'2222-22-22') as date) as test5",
           "CAST(from_unixtime(unix_timestamp()) AS TIMESTAMP) as dt"
         )
     }.join(spark.table("testtable"), Seq("sk"), "leftanti")
       .write
       .format("parquet")
       .mode("Append")
       .saveAsTable("testtable")
}

当我不写就将其粘贴到REPL时，就可以了。但是，当我想对此df调用任何操作（即显示，保存）或什至解释计划时，它就会冻结。

我想我最终会找到运行此查询的解决方案。但是我对这里的Spark行为更感兴趣，因为它只是挂起了。我正在该群集上运行数千个查询。但是通常我会得到一些信息（例如警告，错误等）。但是在此特定查询中，它只是挂起。

更新

我开始从select语句中一一删除列，最后得到了查询计划。列越少，查询计划的生成速度就越快。但是，所有列都需要一定的时间，因为似乎执行查询之前已经制定了很长时间的查询计划。有什么办法可以加快速度吗？我试图增加驱动程序的内存，但是没有运气。我正在使用Spark 2.1和2.2。

查询在执行之前挂起（查询计划需要时间）

0 个答案: