使用一个查询在大查询中展平事件数据

时间:2013-06-28 12:08:41

标签: statistics aggregate-functions bigdata google-bigquery

我们在分析数据的大查询中有超过100万行。每条记录都是附加到id的事件。

简化:

ID  EventId  Timestamp

是否可以将其展平为一个包含以下行的表:

ID timestamp-period event1 event2 event3 event4

事件列是否包含该时间段内该ID的事件数?

到目前为止,我已经设法在包含2个查询的小型数据集上进行此操作。一个用于创建包含单个事件id的计数的行,另一个用于将这些行平移到一行之后。我还没有能够在整个数据集中执行此操作的原因是bigquery耗尽资源 - 不完全确定原因。

这两个查询看起来像这样:

SELECT 
VideoId,
date_1,
IF(EventId = 1, INTEGER(count), 0) AS user_play,
IF(EventId = 2, INTEGER(count), 0) AS auto_play,
IF(EventId = 3, INTEGER(count), 0) AS pause,
IF(EventId = 4, INTEGER(count), 0) AS replay,
IF(EventId = 5, INTEGER(count), 0) AS stop,
IF(EventId = 6, INTEGER(count), 0) AS seek,
IF(EventId = 7, INTEGER(count), 0) AS resume,
IF(EventId = 11, INTEGER(count), 0) AS progress_25,
IF(EventId = 12, INTEGER(count), 0) AS progress_50,
IF(EventId = 13, INTEGER(count), 0) AS progress_75,
IF(EventId = 14, INTEGER(count), 0) AS progress_90,
IF(EventId = 15, INTEGER(count), 0) AS data_loaded,
IF(EventId = 16, INTEGER(count), 0) AS playback_complete,
IF(EventId = 30, INTEGER(count), 0) AS object_click,
IF(EventId = 31, INTEGER(count), 0) AS object_rollover,
IF(EventId = 32, INTEGER(count), 0) AS object_clickthrough,
IF(EventId = 33, INTEGER(count), 0) AS object_shown,
IF(EventId = 34, INTEGER(count), 0) AS object_close,
IF(EventId = 40, INTEGER(count), 0) AS logo_clickthrough,
IF(EventId = 41, INTEGER(count), 0) AS endframe_clickthrough,
IF(EventId = 42, INTEGER(count), 0) AS startframe_clickthrough,
IF(EventId = 61, INTEGER(count), 0) AS share_facebook,
IF(EventId = 62, INTEGER(count), 0) AS share_twitter,
IF(EventId = 63, INTEGER(count), 0) AS open_social_panel,
IF(EventId = 70, INTEGER(count), 0) AS embed_code_requested,
IF(EventId = 80, INTEGER(count), 0) AS player_impression,
IF(EventId = 81, INTEGER(count), 0) AS player_loaded,
IF(EventId = 90, INTEGER(count), 0) AS html5_impression,
IF(EventId = 91, INTEGER(count), 0) AS html5_load,
IF(EventId = 95, INTEGER(count), 0) AS fallback_impression,
IF(EventId = 96, INTEGER(count), 0) AS fallback_load,
IF(EventId = 152, INTEGER(count), 0) AS object_impression,
IF(EventId = 200, INTEGER(count), 0) AS ping,
IF(EventId = 250, INTEGER(count), 0) AS facebook_clickthrough,
IF(EventId = 251, INTEGER(count), 0) AS twitter_clickthrough,
IF(EventId = 252, INTEGER(count), 0) AS other_clickthrough,
IF(EventId = 253, INTEGER(count), 0) AS qr_clickthrough,
IF(EventId = 254, INTEGER(count), 0) AS banner_clickthrough,
IF(EventId = 280, INTEGER(count), 0) AS banner_impression,
IF(EventId = 281, INTEGER(count), 0) AS banner_loaded,
IF(EventId = 282, INTEGER(count), 0) AS banner_data_loaded,
IF(EventId = 284, INTEGER(count), 0) AS banner_forward,
IF(EventId = 285, INTEGER(count), 0) AS banner_back,
IF(EventId = 300, INTEGER(count), 0) AS mobile_preview_loaded,
IF(EventId = 301, INTEGER(count), 0) AS mobile_preview_clickthrough,
IF(EventId = 302, INTEGER(count), 0) AS mobile_preview_clickthrough_back,
IF(EventId = 310, INTEGER(count), 0) AS product_search_click,
IF(EventId = 311, INTEGER(count), 0) AS promo_code_click,
IF(EventId = 320, INTEGER(count), 0) AS player_share_facebook,
IF(EventId = 321, INTEGER(count), 0) AS player_share_twitter,
IF(EventId = 322, INTEGER(count), 0) AS player_share_googleplus,
IF(EventId = 323, INTEGER(count), 0) AS player_share_email,
IF(EventId = 324, INTEGER(count), 0) AS player_share_embed,
IF(EventId = 401, INTEGER(count), 0) AS youtube_error_2,
IF(EventId = 402, INTEGER(count), 0) AS youtube_error_100,
IF(EventId = 403, INTEGER(count), 0) AS youtube_error_101,
FROM
(
SELECT 
  VideoId, EventId, count(*) as count, Date(timestamp) as date_1  
FROM [data.data_1]
GROUP EACH BY VideoId, EventId, date_1
)
ORDER BY data_loaded DESC;

然后只有id和timestamp上的一个组创建完整的聚合表。

我是否以正确的方式执行此操作,我是否只需要在数据集的一个小分区上执行此操作,或者是否有更好的方法进行聚合,以更有效的方式使用bigquery?

提前致谢, 垫

1 个答案:

答案 0 :(得分:1)

我的猜测是,由于最后的ORDER BY,你的资源已经耗尽。其他一切都应该能够并行完成。另请注意,如果您删除订单,则可以使用“允许大结果”标记并写出结果的大表(如果结果大于128MB)。