我试图将PARTITION BY
子句与ARRAY_AGG()
函数一起使用,以将列折叠为数组。
我在Big Query中的标准SQL如下:
WITH initial_30days
AS (
SELECT
date,
fullvisitorId AS user_id,
visitNumber,
CONCAT(fullvisitorid, CAST(VisitId AS STRING)) AS session_id
FROM
`my-data.XXXXXXX.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20181004' AND '20181103'
GROUP BY 1,2,3,4
)
SELECT
date,
ARRAY_AGG(sessions) OVER (PARTITION BY date ROWS BETWEEN 5 PRECEDING
AND CURRENT ROW) AS agg_array
FROM(
SELECT
date,
user_id,
COUNT(DISTINCT( session_id)) AS sessions
FROM initial_30days
GROUP BY date,user_id)
GROUP BY date,sessions
我的预期输出是:
+----------+--------------------------+
| date | agg_array |
+----------+--------------------------+
| 20181004 | [34,21,34,21,6,7,4,43] |
| 20181005 | [1,5,56,76,23,1,3,54,45] |
| 20181006 | [22,67,43,1,2,67,3,24] |
| 20181007 | [34,21,34,21,6,7,4,43] |
+----------+--------------------------+
我当前的输出看起来像这样,以一个日期值为例:
+----------+------------------------+
| date | agg_array |
+----------+------------------------+
| 20181004 | [34] |
| 20181004 | [34,21] |
| 20181004 | [34,21,34] |
| 20181004 | [34,21,34,21] |
| 20181004 | [34,21,34,21,6] |
| 20181004 | [34,21,34,21,6,7] |
| 20181004 | [34,21,34,21,6,7,4] |
| 20181004 | [34,21,34,21,6,7,4,43] |
+----------+------------------------+
您可以看到按日期划分的数组会为该数组的每个值创建一个增量行。
应用ARRAY_AGG()
函数的数据集如下:
+----------+------------------+----------+
| date | user_id | sessions |
+----------+------------------+----------+
| 20181004 | 2526262363754747 | 34 |
| 20181004 | 2525626325173256 | 21 |
| 20181004 | 7436783255747736 | 34 |
| 20181004 | 6526241526363536 | 21 |
| 20181004 | 4252636353637423 | 6 |
| 20181004 | 3636325636673563 | 7 |
+----------+------------------+----------+
我之所以感到如此,是因为我按照上面的sessions
进行分组,但这是因为我收到了这样的验证错误,如果不这样做的话:
SELECT list expression references column sessions which is
neither grouped nor aggregated at
答案 0 :(得分:1)
以下是用于BigQuery标准SQL
只需在原始查询周围添加以下内容
SELECT date,
ARRAY_AGG(STRUCT(agg_array) ORDER BY ARRAY_LENGTH(agg_array) DESC LIMIT 1)[OFFSET(0)].*
FROM (
...
...
)
GROUP BY date
因此,整个内容将如下所示(并会产生期望的结果-同时保留您使用窗口函数的想法)
#standardSQL
WITH initial_30days AS (
SELECT
date,
fullvisitorId AS user_id,
visitNumber,
CONCAT(fullvisitorid, CAST(VisitId AS STRING)) AS session_id
FROM `my-data.XXXXXXX.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20181004' AND '20181103'
GROUP BY 1,2,3,4
)
SELECT date,
ARRAY_AGG(STRUCT(agg_array) ORDER BY ARRAY_LENGTH(agg_array) DESC LIMIT 1)[OFFSET(0)].*
FROM (
SELECT
date,
ARRAY_AGG(sessions) OVER(PARTITION BY date ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS agg_array
FROM(
SELECT
date,
user_id,
COUNT(DISTINCT( session_id)) AS sessions
FROM initial_30days
GROUP BY date,user_id
)
GROUP BY date,sessions
)
GROUP BY date
答案 1 :(得分:0)
如果您希望每个日期显示一行,则需要GROUP BY date
:
SELECT date,
ARRAY_AGG(sessions) AS agg_array
FROM (SELECT date, user_id,
COUNT(DISTINCT( session_id)) AS sessions
FROM initial_30days
GROUP BY date, user_id
) du
GROUP BY date;
如果只需要一定数量的值,则将LIMIT
添加到ARRAY_AGG()
。例如,如果您希望为ID最小的用户进行5次会话,则可以执行以下操作:
ARRAY_AGG(sessions ORDER BY user_id LIMIT 5) AS agg_array