这两个Spark SQL查询在逻辑方面是否相同?

时间:2016-07-06 22:03:58

标签: sql apache-spark apache-spark-sql

我有两个问题,我认为应该在逻辑上做同样的事情,但实际上会返回不同的结果。目标是查找从HTTP个请求中解析出的行,这些请求的controller_type类似于少数情况之一,而controller_context_id不是NULL或{{1} }。如果满足这些条件,请为该行指定一个值,将所有这些值相加,然后将它们组合在一起。第一个查询错误地包含未填充''的行:

controller_context_id

第二个查询执行正确的操作:

SELECT
            TRUNC(request_timestamp, 'month') AS request_timestamp,
            account_id,
            account_guid,
            cluster_id,
            shard_id,
            unique_id,
            context_id,
            controller_type,
            controller_context_id,
            concat_user_id,
            user_id,
            COUNT(account_id) AS num_page_views,
            SUM(CASE
                    WHEN controller_type LIKE 'pages%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_pages,
            SUM(CASE
                    WHEN controller_type LIKE 'files%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_files,
            SUM(CASE
                    WHEN controller_type LIKE 'modules%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_modules,
            SUM(CASE
                    WHEN controller_type LIKE 'assignments%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_assignments,
            SUM(CASE
                    WHEN controller_type LIKE 'quizzes%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_quizzes,
            SUM(CASE
                    WHEN controller_type LIKE 'discussion_topics%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_discussion_topics,
            SUM(CASE
                    WHEN controller_type LIKE 'outcome%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_outcomes,
            COUNT(DISTINCT session_id) AS num_sessions
        FROM requests
        GROUP BY
          TRUNC(request_timestamp, 'month'),
          account_id,
          account_guid,
          cluster_id,
          shard_id,
          unique_id,
          context_id,
          context_id,
          controller_type,
          controller_context_id,
          concat_user_id,
          user_id

据我所知,他们应该做同样的事情,第二个查询只使用嵌套的SELECT TRUNC(request_timestamp, 'month') AS request_timestamp, account_id, account_guid, cluster_id, shard_id, unique_id, context_id, controller_type, controller_context_id, concat_user_id, user_id, COUNT(account_id) AS num_page_views, COUNT(DISTINCT session_id) AS num_sessions, SUM(CASE WHEN controller_type LIKE 'pages%' AND (CASE WHEN controller_context_id = '' OR controller_context_id IS NULL THEN 0 ELSE 1 END) = 1 THEN 1 ELSE 0 END) AS num_page_views_pages, SUM(CASE WHEN controller_type LIKE 'files%' AND (CASE WHEN controller_context_id = '' OR controller_context_id IS NULL THEN 0 ELSE 1 END) = 1 THEN 1 ELSE 0 END) AS num_page_views_files, SUM(CASE WHEN controller_type LIKE 'modules%' AND (CASE WHEN controller_context_id = '' OR controller_context_id IS NULL THEN 0 ELSE 1 END) = 1 THEN 1 ELSE 0 END) AS num_page_views_modules, SUM(CASE WHEN controller_type LIKE 'assignments%' AND (CASE WHEN controller_context_id = '' OR controller_context_id IS NULL THEN 0 ELSE 1 END) = 1 THEN 1 ELSE 0 END) AS num_page_views_assignments, SUM(CASE WHEN controller_type LIKE 'quizzes%' AND (CASE WHEN controller_context_id = '' OR controller_context_id IS NULL THEN 0 ELSE 1 END) = 1 THEN 1 ELSE 0 END) AS num_page_views_quizzes, SUM(CASE WHEN controller_type LIKE 'discussion_topics%' AND (CASE WHEN controller_context_id = '' OR controller_context_id IS NULL THEN 0 ELSE 1 END) = 1 THEN 1 ELSE 0 END) AS num_page_views_discussion_topics, SUM(CASE WHEN controller_type LIKE 'outcomes%' AND (CASE WHEN controller_context_id = '' OR controller_context_id IS NULL THEN 0 ELSE 1 END) = 1 THEN 1 ELSE 0 END) AS num_page_views_outcomes FROM requests GROUP BY TRUNC(request_timestamp, 'month'), account_id, account_guid, cluster_id, shard_id, unique_id, context_id, controller_type, controller_context_id, concat_user_id, user_id 语句来检查CASENULL值。我无法理解两者之间的逻辑差异,如果有的话。

0 个答案:

没有答案