Unnest和totals.timeOnSite(BigQuery和Google Analytics数据)

时间:2017-07-21 07:37:51

标签: google-analytics google-bigquery

我想计算网站所有访问者的总timeOnSite(并将其除以3600,因为它在原始数据中存储为秒),然后我想在content_group和自定义上细分它变量,称为content_level。

问题出现是因为content_group和content_level都嵌套在数组中,而timeOnSite是一个totals.-stored变量,如果在包含和取消的查询中使用,它会被夸大。 (content_group是一个普通的hits.-nested变量,而content_level嵌套在嵌套在命中的customDimensions中(第二级嵌套变量) (Will和Thomas C解释了为什么在这个问题Google Analytics Metrics are inflated when extracting hit level data using BigQuery中出现这个问题,但是我无法将他们的建议应用于totals.timeOnSite指标)

#StandardSQL
SELECT   
 date,   
 content_group,   
 content_level,  
 SUM(sessions) AS sessions,   
 SUM(sessions2) AS sessions2, 
 SUM(time_on_site) AS time_on_site   
FROM (   
     SELECT   
       date AS date,   
       hits.contentGroup.contentGroup1 AS content_group,   
       (SELECT MAX(IF(index=51, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_level,   
       SUM(totals.visits) AS sessions,   
       COUNT(DISTINCT CONCAT(cast(visitId AS STRING), fullVisitorId)) AS sessions2,   
       SUM(totals.timeOnSite)/3600 AS time_on_site   
     FROM `projectname.123456789.ga_sessions_20170101`,   
       unnest(hits) AS hits
     GROUP BY   
       iso_date, content_group, content_level
     ORDER BY 
       iso_date, content_group, content_level
    )   
GROUP BY iso_date, content_group, content_level
ORDER BY iso_date, content_group, content_level 

(我使用子查询因为我计划使用UNION_ALL从多个表中提取数据,但我省略了该语法,因为我认为它与问题无关。)

问题:

*是否可以制作"本地排除"两个命中。和hits.customDimensions,这样就可以在我的查询中使用totals.timeOnSite而不会被夸大?

*是否可以在网站上制作一个解决方法,就像我使用会话和会话2制作一样?

*这个问题有第三个隐藏的解决方案吗?

2 个答案:

答案 0 :(得分:0)

我无法完全测试这个,但它似乎与我的数据集有关:

SELECT
  DATE,
  COUNT(DISTINCT CONCAT(fv, CAST(v AS STRING))) sessions,
  AVG(tos) avg_time_on_site,
  content_group,
  content_level
FROM(
  SELECT   
   date AS date,   
   fullvisitorid fv,
   visitid v,
   ARRAY(SELECT DISTINCT contentGroup.contentGroup1 FROM UNNEST(hits)) AS content_group,   
   ARRAY(SELECT DISTINCT value FROM UNNEST(hits) AS hits, UNNEST(hits.customDimensions) AS custd WHERE index = 51) AS content_level,   
   totals.timeOnSite / 3600 AS tos 
  FROM `dataset_id.ga_sessions_20170101`
  WHERE totals.timeOnSite IS NOT NULL
  )
CROSS JOIN UNNEST(content_group) content_group
LEFT JOIN UNNEST(content_level) content_level
GROUP BY
  DATE, content_group, content_level

我尝试做的是首先避免对整个数据集进行UNNEST(hits)操作。因此,在第一个SELECT语句中,content_groupcontent_level存储为ARRAY。

在接下来的SELECT中,我对两个ARRAY进行了计算,并计算了总会话数和网站上的平均时间,同时对所需的字段进行分组(我在这里使用了平均值,因为它似乎更有意义处理网站上的时间,但如果您需要总结,您只需将AVG更改为SUM)。

您不会在此查询中遇到重复timeOnSite的问题,因为外部UNNEST(hits)已被避免。当UNNEST(content_group)UNNEST(content_level)发生时,这些ARRAY中的每个值只与其对应的time_on_site相关联一次,因此不会发生重复。

答案 1 :(得分:0)

我回答自己这样的问题似乎很奇怪,但是我从Stack Overflow之外的联系人帮助我解决了这个问题,所以它实际上是他的回答而不是我的回答。

session_duration的问题可以通过使用窗口函数来解决(您可以在BigQuery文档中阅读有关窗口函数的更多信息:https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-functions

#StandardSQL
SELECT   
 iso_date,   
 content_group,   
 content_level,  
 COUNT(DISTINCT SessionId) AS sessions, 
 SUM(session_duration) AS session_duration 
FROM (   
     SELECT   
       date AS iso_date,   
       hits.contentGroup.contentGroup1 AS content_group,   
       (SELECT MAX(IF(index=51, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_level,  
       CONCAT(CAST(fullVisitorId AS STRING), CAST(visitId AS STRING)) AS SessionId, 
       (LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.time ASC) - hits.time) / 3600000 AS session_duration 
     FROM `projectname.123456789.ga_sessions_20170101`,   
       unnest(hits) AS hits
     WHERE _TABLE_SUFFIX BETWEEN "20170101" AND "20170131" 
       AND (SELECT 
              MAX(IF(index=51, value, NULL)) 
            FROM 
              UNNEST(hits.customDimensions) 
            WHERE 
              value IN ("web", "phone", "tablet")
            ) IS NOT NULL 
     GROUP BY   
       iso_date, content_group, content_level
     ORDER BY 
       iso_date, content_group, content_level
    )   
GROUP BY iso_date, content_group, content_level
ORDER BY iso_date, content_group, content_level 

子窗口中的LEAD - OVER - PARTITION和WHERE子句中的子选择都是窗口函数正常工作所必需的。

还提供了更准确的会话计算方法。