我尝试从Google Analytics数据的大查询表中查询每个来源的访问次数,但需要在着陆页级别过滤掉一些会话。因此,我通过登录页面预先查询visitID并重新加入会话数据,如下所示:
#StandardSQL
WITH landingpages AS (
SELECT
visitID,
h.page.pagePath AS LandingPage
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1
AND
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
# filters to be added here
)
SELECT
sessions.trafficSource.source,
SUM(sessions.totals.visits) AS visits
FROM `project.dataset.ga_sessions_*` AS sessions
JOIN
landingpages
ON
landingpages.visitID = sessions.visitID
WHERE
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
GROUP BY
trafficSource.source
ORDER BY
visits DESC
这大致是GA报告的每个来源的会话数量的两倍。
有谁可以指出我做错了什么? (我怀疑这是非常明显的)
我已经尝试检查第一个查询的数据输出,除了很小比例的重复访问ID之外,它无法找到任何错误。我现在也尝试了各种不同类型的JOIN,现在都可以使用。
答案 0 :(得分:1)
从GBQ查询ga数据时,必须知道并记住,fullVisitorID和visitID都表示唯一访问。只有两者的双连接才会返回有意义的数据集。
这是我应该写的:
#StandardSQL
WITH landingpages AS (
SELECT
fullVisitorId,
visitID,
h.page.pagePath AS LandingPage
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1
AND
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
),
session_data AS (
SELECT
date AS ga_date, trafficSource.source AS source, fullVisitorId, visitID, SUM(totals.visits) AS visits
FROM
`project.dataset.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
AND
totals.visits > 0
GROUP BY ga_date, source, fullVisitorId, visitID
)
SELECT
ga_date, source, SUM(visits) AS Sessions
FROM
landingpages
JOIN
session_data
ON
landingpages.VisitID = session_data.VisitID
AND
landingpages.fullVisitorId = session_data.fullVisitorId
GROUP BY
ga_date, source
ORDER BY
Sessions DESC