使用带有DISTINCT的ARRAY_AGG()和带有ORDINAL的ORDER BY

时间:2017-10-06 00:06:20

标签: google-bigquery

我有一些我想要聚合的数据(在这里大大简化)。原始数据使用类似于以下的模式:

UserID - STRING
A - RECORD REPEATED
A.Action - STRING
A.Visit - INTEGER
A.Order - INTEGER
MISC - RECORD REPEATED
( other columns omitted here )

由于“MISC”列,有许多实际记录,但我只是关注上面显示的前5列。原始数据的示例如下所示(请注意,显示的值仅为示例,存在许多其他值,因此无法将这些值硬编码到查询中):

表0 :(原始数据样本)

(UserID下的空值如BiqQuery中所示 - “A”字段是嵌套记录的一部分)

Table0

我的查询生成下面表1 中显示的数据。我正在尝试使用带有ORDINAL的ARRAY_AGG来为每个用户选择前两个“Action”并进行重组,如表2所示。

SELECT
  UserId, ARRAY_AGG( STRUCT(A.Action, A.Visit, A.Order)
          ORDER BY A.Visit, A.Order, A.Action ) 
  FROM
    `table` 
  LEFT JOIN UNNEST(A) AS A
GROUP BY
    UserId

表1 :(上述查询的示例输出)

Table1

表2 :(所需格式)

Table2

所以我需要:

  1. 为每个用户获取不同的“操作”值
  2. 保留订单(UserID,Visit,Order)
  3. 仅显示一行中的第一个和第二个操作
  4. 我尝试的查询策略是使用以下内容对ORID BY UserID,Visit,Order和获取Action的DISTINCT值:

    UserId,
    ARRAY_AGG(DISTINCT Action ORDER BY UserID, Visit, Order) FirstAction,
    ARRAY_AGG(DISTINCT Action ORDER BY UserID, Visit, Order) SecondAction
    

    但是,该方法会产生以下错误:

      

    错误:同时具有DISTINCT和ORDER BY参数的聚合函数只能ORDER BY作为函数参数的列

    有关如何纠正此错误(或替代方法?)的任何想法

2 个答案:

答案 0 :(得分:3)

如果表2中显示的结果不需要重复数据删除,则不确定原始查询为何具有DISTINCT

随着说:

#standardSQL
WITH sample AS (
  SELECT actor.login userid, type action
    , EXTRACT(HOUR FROM created_at) visit
    , EXTRACT(MINUTE FROM created_at) `order`
  FROM `githubarchive.day.20171005` 
)

SELECT userid, actions[OFFSET(0)] firstaction, actions[SAFE_OFFSET(1)] secondaction
FROM (
  SELECT userid, ARRAY_AGG(action ORDER BY visit, `order` LIMIT 2) actions
  FROM sample
  GROUP BY 1
  ORDER BY 1
  LIMIT 100
)

enter image description here

答案 1 :(得分:1)


试试以下。

#standardSQL
SELECT UserId, 
  ARRAY_AGG(Action ORDER BY Visit, `Order`, Action LIMIT 2)[SAFE_ORDINAL(1)] AS FirstAction, 
  ARRAY_AGG(Action ORDER BY Visit, `Order`, Action LIMIT 2)[SAFE_ORDINAL(2)] AS SecondAction 
FROM `project.dataset.table`
LEFT JOIN UNNEST(A) AS A 
GROUP BY UserId
-- ORDER BY UserId

您可以使用问题中的虚拟数据进行测试/播放

#standardSQL
WITH `table` AS (
  SELECT 'U001' AS UserId, [STRUCT<Action STRING, Visit INT64, `Order` INT64 >
    ('Register', 1, 1),('Upgrade', 1, 2),('Feedback', 1, 3),('Share', 1, 4),('Share', 2, 1)] AS A UNION ALL
  SELECT 'U002', [STRUCT<Action STRING, Visit INT64, `Order` INT64 >
    ('Share', 7, 1),('Share', 7, 2),('Refer', 8, 1),('Feedback', 8, 2),('Feedback', 8, 3)] UNION ALL
  SELECT 'U003', [STRUCT<Action STRING, Visit INT64, `Order` INT64 >
    ('Register', 1, 1),('Share', 1, 2),('Share', 1, 3),('Share', 2, 1),('Share', 2, 2),('Share', 3, 1),('Share', 3, 2)] 
)
SELECT UserId, 
  ARRAY_AGG(Action ORDER BY Visit, `Order`, Action LIMIT 2)[SAFE_ORDINAL(1)] AS FirstAction, 
  ARRAY_AGG(Action ORDER BY Visit, `Order`, Action LIMIT 2)[SAFE_ORDINAL(2)] AS SecondAction 
FROM `table`
LEFT JOIN UNNEST(A) AS A 
GROUP BY UserId
ORDER BY UserId