Bigquery不需要点击 - 重复值)

时间:2017-06-23 14:28:33

标签: google-bigquery

我试图在导入大查询的属性中创建一个组的主视图,但似乎是使用不正确的(命中)SQL复制数据导致收入的值不准确等等......

我试着去理解为什么不应该造成这种情况,但我无法弄明白。

SELECT Date, hostname, channelGrouping, sum(transactionRevenue) as Revenue, sum(Shipping) as Shipping, sum(visits) as Sessions, sum(bounces) as Bounces, sum(transactions) as Transactions
    FROM
        (SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
        FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
        WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
        UNION ALL
        SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
        FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
        WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
        UNION ALL
        SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
        FROM `102674002.ga_sessions_*`, UNNEST(hits) AS h
        WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
        UNION ALL
        SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
        FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
        WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
        UNION ALL
        SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
        FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
        WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509')
    Group By Date, hostname, channelGrouping
    Order by Date

1 个答案:

答案 0 :(得分:3)

这可能会起到作用:

SELECT
  date,
  channelGrouping,
  SUM(Revenue) Revenue,
  SUM(Shipping) Shipping,
  SUM(bounces) bounces,
  SUM(transactions) transactions,
  hostname,
  COUNT(date) sessions
FROM(
  SELECT 
    date,
    channelGrouping,
    totals.totaltransactionrevenue / 1e6 Revenue,
    ARRAY((SELECT DISTINCT page.hostname FROM UNNEST(hits) hits WHERE page.hostname IS NOT NULL)) hostnames,
    (SELECT SUM(hits.transaction.transactionshipping) / 1e6 FROM UNNEST(hits) hits) Shipping,
    totals.bounces bounces,
    totals.transactions transactions
  FROM `project_id.dataset_id.ga_sessions_*`
  WHERE 1 = 1
  AND ARRAY_LENGTH(ARRAY((SELECT DISTINCT page.hostname FROM UNNEST(hits) hits WHERE page.hostname IS NOT NULL))) > 0
  AND _TABLE_SUFFIX BETWEEN '20170601' AND '20170609'

  UNION ALL
  (...)

),
UNNEST(hostnames) hostname
GROUP BY
  date, channelGrouping, hostname

请注意,在此查询中,我避免在UNNEST字段中应用hits操作,而我只在子选择中执行此操作。

为了理解为什么会出现这种情况,您必须了解如何将ga data聚合到BigQuery中。请注意,我们基本上有两种类型的数据:session级别数据和hits级别。访问您网站的每个客户最终会在BigQuery中生成一行,如下所示:

{fullvisitorid: 1, visitid:1, date: '20170601', channelGrouping: "search", hits: [{hitNumber: 1, page: {hostname: "yourserverhostname"}}, {hitNumber: 2, page: {hostname: "yourserverhostname"}}, totals: {totalTransactionRevenue:0, bounces: 0}]

如果同一个客户在一天后回来,它会在BQ中生成另一行,例如:

{fullvisitorid: 1, visitid:2, date: '20170602', channelGrouping: "search", hits: [{hitNumber: 1, page: {hostname: "yourserverhostname"}}, {hitNumber: 2, page: {hostname: "yourserverhostname"}}, totals: {totalTransactionRevenue:50000000, bounces: 2}]

正如您所看到的,密钥hits之外的字段与会话级别相关(因此每次点击,即客户在您的网站中进行的每次互动,都会在此处添加另一个条目)。当您应用UNNEST时,基本上,将cross-join应用于数组内部的所有值到外部字段。

这就是重复发生的地方!

鉴于过去的示例,如果我们将UNNEST应用于hits字段,您最终会得到以下内容:

fullvisitorid    visitid    totals.totalTransactionRevenue    hits.hitNumber
1                1          0                                 1
1                1          0                                 2
1                2          50000000                          1
1                2          50000000                          2

请注意,对于hits字段中的每次匹配,都会导致totals.totalTransactionRevenue ARRAY内发生的每个hitNumber的外部字段(例如hits)重复。 / p>

因此,如果稍后,您应用某些操作,例如SUM(totals.totalTransactionRevenue),您最终将此字段相加乘以客户在该visitid中的每次点击。

我倾向于避免UNNEST字段上的({1}} hits操作(left = iter(range(15, 60, 3)) right = iter(range(0, 50, 5)) try: i = next(left) j = next(right) while True: if abs(i-j) < 1: print("pair", i, j) i = next(left) j = next(right) elif i <= j: print("left", i, None) i = next(left) else: print("right", None, j) j = next(right) except StopIteration: pass # one of the iterators may have leftover elements for i in left: print("left", i, None) for j in right: print("right", None, j) 操作('right', None, 0) ('right', None, 5) ('right', None, 10) ('pair', 15, 15) ('left', 18, None) ('right', None, 20) ('left', 21, None) ('left', 24, None) ('right', None, 25) ('left', 27, None) ('pair', 30, 30) ('left', 33, None) ('right', None, 35) ('left', 36, None) ('left', 39, None) ('right', None, 40) ('left', 42, None) ('pair', 45, 45) ('left', 51, None) ('left', 54, None) ('left', 57, None) 只在子查询中执行此操作(仅在行中发生取消操作)不重复数据的级别。)