SQL查询给出:SQL语句中的错误:package.TreeNodeException:执行,树:

时间:2019-04-10 17:59:07

标签: python sql pyspark apache-spark-sql pyspark-sql

运行一个中等复杂的SQL查询并遇到此错误,我找不到很好的解释,以前也没有遇到过:

Error in SQL statement: package.TreeNodeException: execute, tree:

作为问题的一部分,我将在此处包括整个查询,因为我无法隔离出一个演示问题的小例子:

WITH visits AS (
  SELECT 
    visitor_key
    , channel_vec AS digital_marketing_channel
    , to_timestamp(date_time, "MM/dd/yyyy HH:mm") AS timestamp
    , (HOUR(to_timestamp(date_time, "MM/dd/yyyy HH:mm")) / 24) + (MINUTE(to_timestamp(date_time, "MM/dd/yyyy HH:mm")) / (24 * 60)) AS days_carried
    , conversion
  FROM vectorised
), conversions_only AS (
  SELECT
    visitor_key 
    , conversion
    , timestamp
    , days_carried
    , RANK() OVER(PARTITION BY visitor_key ORDER BY timestamp) AS conversion_rank
  FROM visits
  WHERE conversion = 1
), all_conversions AS (
  SELECT 
    v.*
    , MIN(conversion_rank) AS path_id
  FROM visits v
  JOIN conversions_only c ON v.visitor_key = c.visitor_key
  WHERE v.timestamp <= c.timestamp
  GROUP BY
    v.visitor_key
    , v.digital_marketing_channel
    , v.timestamp
    , v.days_carried
    , v.conversion
), converted_paths AS (
SELECT 
  a.*
  , CASE 
      WHEN path_id > 1 THEN 1
      ELSE 0 
    END AS previous_conversion
  , DATEDIFF(c.timestamp, a.timestamp) + c.days_carried - a.days_carried AS path_days_remaining
  , 1 AS converted_path
FROM all_conversions a
JOIN conversions_only c ON a.visitor_key = c.visitor_key AND a.path_id = c.conversion_rank
), all_paths AS (
  SELECT 
    visitor_key
    , 0 AS path_id
    , digital_marketing_channel
    , conversion
    , DATEDIFF("2019-04-02", timestamp) - days_carried AS path_days_remaining
    , 0 AS converted_path
    , 0 AS previous_conversion
  FROM visits
  WHERE visitor_key NOT IN (SELECT DISTINCT visitor_key FROM all_conversions)
  UNION ALL 
  SELECT 
    visitor_key
    , path_id
    , digital_marketing_channel
    , conversion
    , path_days_remaining
    , converted_path
    , previous_conversion
  FROM converted_paths
), steps AS (
  SELECT 
    *
    , ROW_NUMBER() OVER(PARTITION BY visitor_key, path_id ORDER BY path_days_remaining, conversion DESC) AS step_id
  FROM all_paths
  WHERE conversion = 0
  ORDER BY visitor_key, path_id, path_days_remaining DESC, conversion
), output AS (
SELECT 
  visitor_key
  , path_id
  , pad_matrix(collect_list(digital_marketing_channel), 10) AS channels
  , collect_list(path_days_remaining) AS days_remaining
  , converted_path
  , previous_conversion
FROM steps
WHERE step_id < 11
GROUP BY 
  visitor_key
  , path_id
  , converted_path
  , previous_conversion
), helper AS (
  SELECT 
    visitor_key
    , path_id
    , converted_path
    , previous_conversion
    , COUNT(*) AS steps
  FROM all_paths
  GROUP BY 
    visitor_key
    , path_id
    , converted_path
    , previous_conversion
)  
SELECT 
  *
FROM helper
WHERE converted_path = 0

该问题似乎起源于“ helper”帮助器表和最终的select语句,并且似乎特定于那个convertdpath = 0元素,converted_pa​​th是包含0和1s的列。要使事情复杂化,请使用

WHERE converted_path = 1

有效,而

WHERE converted_path != 1

引起相同的错误。将where语句上移到“ helper”表中会导致相同的问题。我可以对其他列执行完全相同的分析,而不会出现问题,问题仅在于converted_pa​​th列。如果我将未过滤的“ helper”表的输出另存为数据库中的表,则可以根据需要在新表上执行过滤器查询。同样,如果我将“帮助程序”绘制的“ all_paths”表格另存为新表,则可以在保存的“ all_paths”表上执行“帮助程序”和最终的筛选选择语句。

很明显,这是一个我可以解决的问题,所以我更加担心,根本上我不了解的事情可能发生在union语句所在的all_paths子表中?如果有人能向我指出我所缺少的正确方向,我将非常感激。

谢谢!

0 个答案:

没有答案