将复杂的PSQL查询转换为SparkSQL时不清楚解析错误

时间:2019-03-25 17:36:33

标签: apache-spark pyspark apache-spark-sql pyspark-sql

我一直在Lumen数据库上进行一些分析,并且最近交换了火花,因为csv大于100GB,对于单台计算机来说太大了。

我的大多数查询运行正常,但是,以下内容似乎有一些问题:

psql = "select b.*, "\
"(select count(distinct c.notice_sender) from lumen_sender_duplicate_utility c where c.domain_name = b.domain_name and cast(c.num_of_dup_urls as int) = 0 ) num_of_distinct_senders "\
"from (select a.domain_name, sum(a.num_of_url) total_num_urls, "\
"sum(a.num_of_dup_urls) total_num_dup_urls, count(distinct a.notice_sender) total_num_senders " \
"from lumen_sender_duplicate a group by a.domain_name) b"

我进行了一些更改,但出现了许多错误,但最近的错误是:(https://pastebin.com/raw/Xk4wVDmD上有完整的堆栈跟踪信息)

Caused by: java.lang.RuntimeException: Couldn't find count(DISTINCT notice_sender)#419L in [domain_name#13,sum(cast(num_of_url#14 as double))#415,sum(cast(num_of_dup_urls#16 as double))#416,count(notice_sender#10)#417L]
     at scala.sys.package$.error(package.scala:27)

起初我虽然是因为某些功能(例如独特查询或子查询)不可用,但是我使用的是Spark 2.4,所以一切似乎都很好。 (我还分别测试了每个组件,似乎没有问题)。如果有人知道我要去哪里错了,任何帮助将不胜感激。

0 个答案:

没有答案