使用ARRAY_AGG()函数使用Partition By的Big Query标准SQL

时间:2019-02-13 09:53:34

标签: sql google-bigquery

我试图将PARTITION BY子句与ARRAY_AGG()函数一起使用,以将列折叠为数组。

我在Big Query中的标准SQL如下:

        WITH initial_30days
           AS (
          SELECT 
            date,
            fullvisitorId AS user_id,
            visitNumber, 
            CONCAT(fullvisitorid, CAST(VisitId AS STRING)) AS session_id
          FROM
            `my-data.XXXXXXX.ga_sessions_*`
            WHERE _TABLE_SUFFIX BETWEEN '20181004' AND  '20181103'
            GROUP BY 1,2,3,4
            )

          SELECT
            date,
            ARRAY_AGG(sessions) OVER (PARTITION BY date ROWS BETWEEN 5 PRECEDING 
            AND CURRENT ROW) AS agg_array
          FROM(

          SELECT
            date,
            user_id,
            COUNT(DISTINCT( session_id))  AS sessions
            FROM initial_30days
            GROUP BY date,user_id) 
            GROUP BY date,sessions

我的预期输出是

+----------+--------------------------+
|   date   |        agg_array         |
+----------+--------------------------+
| 20181004 | [34,21,34,21,6,7,4,43]   |
| 20181005 | [1,5,56,76,23,1,3,54,45] |
| 20181006 | [22,67,43,1,2,67,3,24]   |
| 20181007 | [34,21,34,21,6,7,4,43]   |
+----------+--------------------------+

我当前的输出看起来像这样,以一个日期值为例:

+----------+------------------------+
|   date   |       agg_array        |
+----------+------------------------+
| 20181004 | [34]                   |
| 20181004 | [34,21]                |
| 20181004 | [34,21,34]             |
| 20181004 | [34,21,34,21]          |
| 20181004 | [34,21,34,21,6]        |
| 20181004 | [34,21,34,21,6,7]      |
| 20181004 | [34,21,34,21,6,7,4]    |
| 20181004 | [34,21,34,21,6,7,4,43] |
+----------+------------------------+

您可以看到按日期划分的数组会为该数组的每个值创建一个增量行。

应用ARRAY_AGG()函数的数据集如下:

+----------+------------------+----------+
|   date   |     user_id      | sessions |
+----------+------------------+----------+
| 20181004 | 2526262363754747 |       34 |
| 20181004 | 2525626325173256 |       21 |
| 20181004 | 7436783255747736 |       34 |
| 20181004 | 6526241526363536 |       21 |
| 20181004 | 4252636353637423 |        6 |
| 20181004 | 3636325636673563 |        7 |
+----------+------------------+----------+

我之所以感到如此,是因为我按照上面的sessions进行分组,但这是因为我收到了这样的验证错误,如果不这样做的话:

    SELECT list expression references column sessions which is 
neither grouped nor aggregated at 

2 个答案:

答案 0 :(得分:1)

以下是用于BigQuery标准SQL

只需在原始查询周围添加以下内容

  
SELECT date, 
  ARRAY_AGG(STRUCT(agg_array) ORDER BY ARRAY_LENGTH(agg_array) DESC LIMIT 1)[OFFSET(0)].*
FROM (
  ...   
  ...   
)
GROUP BY date   

因此,整个内容将如下所示(并会产生期望的结果-同时保留您使用窗口函数的想法)

#standardSQL
WITH initial_30days AS (
  SELECT 
    date,
    fullvisitorId AS user_id,
    visitNumber, 
    CONCAT(fullvisitorid, CAST(VisitId AS STRING)) AS session_id
  FROM `my-data.XXXXXXX.ga_sessions_*`
  WHERE _TABLE_SUFFIX BETWEEN '20181004' AND  '20181103'
  GROUP BY 1,2,3,4
)
SELECT date, 
  ARRAY_AGG(STRUCT(agg_array) ORDER BY ARRAY_LENGTH(agg_array) DESC LIMIT 1)[OFFSET(0)].*
FROM (
  SELECT
    date, 
    ARRAY_AGG(sessions) OVER(PARTITION BY date ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS agg_array
  FROM(
    SELECT
      date,
      user_id,
      COUNT(DISTINCT( session_id))  AS sessions
    FROM initial_30days
    GROUP BY date,user_id
  )
  GROUP BY date,sessions
)
GROUP BY date   

答案 1 :(得分:0)

如果您希望每个日期显示一行,则需要GROUP BY date

SELECT date,
       ARRAY_AGG(sessions) AS agg_array
FROM (SELECT date, user_id,
             COUNT(DISTINCT( session_id))  AS sessions
      FROM initial_30days
      GROUP BY date, user_id
     )  du
GROUP BY date;

如果只需要一定数量的值,则将LIMIT添加到ARRAY_AGG()。例如,如果您希望为ID最小的用户进行5次会话,则可以执行以下操作:

  ARRAY_AGG(sessions ORDER BY user_id LIMIT 5) AS agg_array