在大查询标准SQL中,为多个查询重用分类规则的最佳方法是什么?

时间:2017-03-27 11:37:15

标签: google-bigquery

我使用Big Query分析Google Analytics数据。

我需要对访问进行分类,具体取决于他们是否访问了表明他们在预订过程中或购买过的特定网址等。

有一长串的URL代表每个步骤,因此在视图中包含分类并使用适当的连接重新使用任何需要分类的查询都是有利的。

我有以下看法似乎做我需要的:

SELECT
  fullVisitorId,
  visitID,
  LOWER(h.page.pagePath) AS path,
  CASE
    WHEN
      LOWER(h.page.pagePath) = '/' THEN '/'
    WHEN
      LOWER(h.page.pagePath) LIKE '{path-here}%' OR
      ....  ....  ....
    ELSE 'other'
    END
  AS path_classification,
  _TABLE_SUFFIX AS date
FROM
  `{project-id}.{data-id}.ga_sessions_*`, UNNEST(hits) AS h
WHERE
  REGEXP_CONTAINS(_TABLE_SUFFIX, r'[0-9]{8}')
AND
  h.type = 'PAGE'

我想知道是否有一种更简单的方法来实现这一目标,并不需要从预先存在的表中进行选择,因为这似乎不是定义分类所必需的。我觉得可以更直接地使用一些东西,但我不知道该怎么做。

有没有人知道如何在不查询视图中的表的情况下将这些定义放入视图中?

2 个答案:

答案 0 :(得分:1)

让我们考虑一个简单的例子:

  
#standardSQL
WITH yourTable AS (
  SELECT 1 AS id, '123' AS path UNION ALL
  SELECT 2, '234' UNION ALL
  SELECT 3, '345' UNION ALL
  SELECT 4, '456' 
)
SELECT 
  id,
  path,
  CASE path
    WHEN '123' THEN 'a'
    WHEN '234' THEN 'b'
    WHEN '345' THEN 'c'
    ELSE 'other'
  END AS path_classification
FROM yourTable
ORDER BY id  

以上可以重构为

#standardSQL
WITH yourTable AS (
  SELECT 1 AS id, '123' AS path UNION ALL
  SELECT 2, '234' UNION ALL
  SELECT 3, '345' UNION ALL
  SELECT 4, '456' 
)
SELECT 
  id, 
  path,
  IFNULL(
    ( SELECT rr.crule FROM UNNEST(r.rules) AS rr WHERE rr.cpath = path  LIMIT 1), 
    ( SELECT rr.crule FROM UNNEST(r.rules) AS rr WHERE rr.cpath IS NULL LIMIT 1)
  ) AS path_classification
FROM yourTable, 
  (SELECT ARRAY_AGG(STRUCT<cpath STRING, crule STRING>(path, rule)) AS rules
   FROM `project.dataset.rules`) AS r
ORDER BY id  

依赖于rules视图定义如下

#standardSQL
SELECT '123' AS path, 'a' AS rule UNION ALL
SELECT '234', 'b' UNION ALL
SELECT '345', 'c' UNION ALL
SELECT NULL, 'other'

如您所见,所有分类规则仅在rules视图中!

您可以使用以下方法来玩这种方法:

#standardSQL
WITH yourTable AS (
  SELECT 1 AS id, '123' AS path UNION ALL
  SELECT 2, '234' UNION ALL
  SELECT 3, '345' UNION ALL
  SELECT 4, '456' 
),
rules AS (
  SELECT '123' AS path, 'a' AS rule UNION ALL
  SELECT '234', 'b' UNION ALL
  SELECT '345', 'c' UNION ALL
  SELECT NULL, 'other'
)
SELECT 
  id, 
  path,
  IFNULL(
    ( SELECT rr.crule FROM UNNEST(r.rules) AS rr WHERE rr.cpath = path  LIMIT 1), 
    ( SELECT rr.crule FROM UNNEST(r.rules) AS rr WHERE rr.cpath IS NULL LIMIT 1)
  ) AS path_classification
FROM yourTable, 
  (SELECT ARRAY_AGG(STRUCT<cpath STRING, crule STRING>(path, rule)) AS rules 
   FROM rules) AS r
ORDER BY id

可以通过在视图中移动ARRAY_AGG来进一步“简化”,如下所示

#standardSQL
SELECT ARRAY_AGG(STRUCT<cpath STRING, crule STRING>(path, rule)) AS rules 
FROM (
  SELECT '123' AS path, 'a' AS rule UNION ALL
  SELECT '234', 'b' UNION ALL
  SELECT '345', 'c' UNION ALL
  SELECT NULL, 'other'
)

在这种情况下,最终查询就像下面的

一样简单
#standardSQL
SELECT 
  id, 
  path,
  IFNULL(
    ( SELECT rr.crule FROM UNNEST(r.rules) AS rr WHERE rr.cpath = path  LIMIT 1), 
    ( SELECT rr.crule FROM UNNEST(r.rules) AS rr WHERE rr.cpath IS NULL LIMIT 1)
  ) AS path_classification
FROM yourTable, rules AS r
ORDER BY id  

取决于您的具体规则 - 以上可以/应该分别进行调整/优化 - 但我希望这能为您提供一个主要方向

  

评论中的问题:您的解决方案是否支持使用与LIKE关键字匹配或与正则表达式匹配?

原始问题是 - What's the … way of re-using classification rules for multiple queries within big query standard SQL?

所以上面的例子在我的初步答案中只是告诉你如何实现这一点(专注于“重用”)

您将如何使用它(与LIKE关键字匹配或与正则表达式匹配)完全取决于您!

见下面的例子
请查看path_classification_exact_match vs path_classification_like_match vs path_classification_regex_match

#standardSQL
WITH yourTable AS (
  SELECT 1 AS id, '123' AS path UNION ALL
  SELECT 2, '234' UNION ALL
  SELECT 3, '345' UNION ALL
  SELECT 4, '456' UNION ALL
  SELECT 5, '234abc' UNION ALL
  SELECT 6, '345bcd' UNION ALL
  SELECT 7, '456cde' 
),
rules AS (
  SELECT ARRAY_AGG(STRUCT<cpath STRING, crule STRING>(path, rule)) AS rules 
  FROM (
    SELECT '123' AS path, 'a' AS rule UNION ALL
    SELECT '234', 'b' UNION ALL
    SELECT '345', 'c' UNION ALL
    SELECT NULL, 'other'
  )
)
SELECT 
  id, 
  path,
  IFNULL(
    ( SELECT rr.crule FROM UNNEST(r.rules) AS rr WHERE rr.cpath = path  LIMIT 1), 
    ( SELECT rr.crule FROM UNNEST(r.rules) AS rr WHERE rr.cpath IS NULL LIMIT 1)
  ) AS path_classification_exact_match,
  IFNULL(
    ( SELECT rr.crule FROM UNNEST(r.rules) AS rr WHERE path LIKE CONCAT('%',rr.cpath,'%')  LIMIT 1), 
    ( SELECT rr.crule FROM UNNEST(r.rules) AS rr WHERE rr.cpath IS NULL LIMIT 1)
  ) AS path_classification_like_match,
  IFNULL(
    ( SELECT rr.crule FROM UNNEST(r.rules) AS rr WHERE REGEXP_CONTAINS(path, rr.cpath)  LIMIT 1), 
    ( SELECT rr.crule FROM UNNEST(r.rules) AS rr WHERE rr.cpath IS NULL LIMIT 1)
  ) AS path_classification_regex_match
FROM yourTable, rules AS r
ORDER BY id  

输出是:

id  path    path_classification_exact_match path_classification_like_match  path_classification_regex_match  
1   123     a                               a                               a    
2   234     b                               b                               b    
3   345     c                               c                               c    
4   456     other                           other                           other    
5   234abc  other                           b                               b    
6   345bcd  other                           c                               c    
7   456cde  other                           other                           other

希望这会有所帮助:o)

答案 1 :(得分:0)

听起来您可能对WITH clauses感兴趣,它可以让您在不必使用子查询的情况下撰写查询。例如,

#standardSQL
WITH Sales AS (
  SELECT 1 AS sku, 3.14 AS price UNION ALL
  SELECT 2 AS sku, 1.00 AS price UNION ALL
  SELECT 3 AS sku, 9.99 AS price UNION ALL
  SELECT 2 AS sku, 0.90 AS price UNION ALL
  SELECT 1 AS sku, 3.56 AS price
),
ItemTotals AS (
  SELECT sku, SUM(price) AS total
  FROM Sales
  GROUP BY sku
)
SELECT sku, total
FROM ItemTotals;

如果要撰写表达式,可以使用CREATE TEMP FUNCTION语句提供“类宏”功能:

#standardSQL
CREATE TEMP FUNCTION LooksLikeCheese(s STRING) AS (
  LOWER(s) IN ('gouda', 'gruyere', 'havarti')
);

SELECT
  s1,
  LooksLikeCheese(s1) AS s1_is_cheese,
  s2,
  LooksLikeCheese(s2) AS s2_is_cheese
FROM (
  SELECT 'spam' AS s1, 'ham' AS s2 UNION ALL
  SELECT 'havarti' AS s1, 'crackers' AS s2 UNION ALL
  SELECT 'gruyere' AS s1, 'ice cream' AS s2
);