我想计算网站所有访问者的总timeOnSite(并将其除以3600,因为它在原始数据中存储为秒),然后我想在content_group和自定义上细分它变量,称为content_level。
问题出现是因为content_group和content_level都嵌套在数组中,而timeOnSite是一个totals.-stored变量,如果在包含和取消的查询中使用,它会被夸大。 (content_group是一个普通的hits.-nested变量,而content_level嵌套在嵌套在命中的customDimensions中(第二级嵌套变量) (Will和Thomas C解释了为什么在这个问题Google Analytics Metrics are inflated when extracting hit level data using BigQuery中出现这个问题,但是我无法将他们的建议应用于totals.timeOnSite指标)
#StandardSQL
SELECT
date,
content_group,
content_level,
SUM(sessions) AS sessions,
SUM(sessions2) AS sessions2,
SUM(time_on_site) AS time_on_site
FROM (
SELECT
date AS date,
hits.contentGroup.contentGroup1 AS content_group,
(SELECT MAX(IF(index=51, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_level,
SUM(totals.visits) AS sessions,
COUNT(DISTINCT CONCAT(cast(visitId AS STRING), fullVisitorId)) AS sessions2,
SUM(totals.timeOnSite)/3600 AS time_on_site
FROM `projectname.123456789.ga_sessions_20170101`,
unnest(hits) AS hits
GROUP BY
iso_date, content_group, content_level
ORDER BY
iso_date, content_group, content_level
)
GROUP BY iso_date, content_group, content_level
ORDER BY iso_date, content_group, content_level
(我使用子查询因为我计划使用UNION_ALL从多个表中提取数据,但我省略了该语法,因为我认为它与问题无关。)
问题:
*是否可以制作"本地排除"两个命中。和hits.customDimensions,这样就可以在我的查询中使用totals.timeOnSite而不会被夸大?
*是否可以在网站上制作一个解决方法,就像我使用会话和会话2制作一样?
*这个问题有第三个隐藏的解决方案吗?
答案 0 :(得分:0)
我无法完全测试这个,但它似乎与我的数据集有关:
SELECT
DATE,
COUNT(DISTINCT CONCAT(fv, CAST(v AS STRING))) sessions,
AVG(tos) avg_time_on_site,
content_group,
content_level
FROM(
SELECT
date AS date,
fullvisitorid fv,
visitid v,
ARRAY(SELECT DISTINCT contentGroup.contentGroup1 FROM UNNEST(hits)) AS content_group,
ARRAY(SELECT DISTINCT value FROM UNNEST(hits) AS hits, UNNEST(hits.customDimensions) AS custd WHERE index = 51) AS content_level,
totals.timeOnSite / 3600 AS tos
FROM `dataset_id.ga_sessions_20170101`
WHERE totals.timeOnSite IS NOT NULL
)
CROSS JOIN UNNEST(content_group) content_group
LEFT JOIN UNNEST(content_level) content_level
GROUP BY
DATE, content_group, content_level
我尝试做的是首先避免对整个数据集进行UNNEST(hits)
操作。因此,在第一个SELECT
语句中,content_group
和content_level
存储为ARRAY。
在接下来的SELECT
中,我对两个ARRAY进行了计算,并计算了总会话数和网站上的平均时间,同时对所需的字段进行分组(我在这里使用了平均值,因为它似乎更有意义处理网站上的时间,但如果您需要总结,您只需将AVG
更改为SUM
)。
您不会在此查询中遇到重复timeOnSite
的问题,因为外部UNNEST(hits)
已被避免。当UNNEST(content_group)
和UNNEST(content_level)
发生时,这些ARRAY中的每个值只与其对应的time_on_site
相关联一次,因此不会发生重复。
答案 1 :(得分:0)
我回答自己这样的问题似乎很奇怪,但是我从Stack Overflow之外的联系人帮助我解决了这个问题,所以它实际上是他的回答而不是我的回答。
session_duration的问题可以通过使用窗口函数来解决(您可以在BigQuery文档中阅读有关窗口函数的更多信息:https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-functions)
#StandardSQL
SELECT
iso_date,
content_group,
content_level,
COUNT(DISTINCT SessionId) AS sessions,
SUM(session_duration) AS session_duration
FROM (
SELECT
date AS iso_date,
hits.contentGroup.contentGroup1 AS content_group,
(SELECT MAX(IF(index=51, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_level,
CONCAT(CAST(fullVisitorId AS STRING), CAST(visitId AS STRING)) AS SessionId,
(LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.time ASC) - hits.time) / 3600000 AS session_duration
FROM `projectname.123456789.ga_sessions_20170101`,
unnest(hits) AS hits
WHERE _TABLE_SUFFIX BETWEEN "20170101" AND "20170131"
AND (SELECT
MAX(IF(index=51, value, NULL))
FROM
UNNEST(hits.customDimensions)
WHERE
value IN ("web", "phone", "tablet")
) IS NOT NULL
GROUP BY
iso_date, content_group, content_level
ORDER BY
iso_date, content_group, content_level
)
GROUP BY iso_date, content_group, content_level
ORDER BY iso_date, content_group, content_level
子窗口中的LEAD - OVER - PARTITION和WHERE子句中的子选择都是窗口函数正常工作所必需的。
还提供了更准确的会话计算方法。