发生不受欢迎的扁平化

时间:2015-11-25 22:06:10

标签: google-analytics google-bigquery

我在导出的GA数据上使用BigQuery(请参阅架构here

查看文档,我看到当我选择记录中的字段时,它会自动展平该记录并复制周围的列。

所以我尝试创建一个非规范化的表,我可以在更像SQL的思维模式中查询

SELECT
  CONCAT( date, " ", if (hits.hour < 10,
      CONCAT("0", STRING(hits.hour)),
      STRING(hits.hour)), ":", IF(hits.minute < 10, CONCAT("0", STRING(hits.minute)), STRING(hits.minute)) ) AS hits.date__STRING,
  CONCAT(fullVisitorId, STRING(visitId)) AS session_id__STRING,
  fullVisitorId AS google_identity__STRING,
  MAX(IF(hits.customDimensions.index=7, hits.customDimensions.value,NULL)) WITHIN RECORD AS customer_id__LONG,
  hits.hitNumber AS hit_number__INT,
  hits.type AS hit_type__STRING,
  hits.isInteraction AS hit_is_interaction__BOOLEAN,
  hits.isEntrance AS hit_is_entrance__BOOLEAN,
  hits.isExit AS hit_is_exit__BOOLEAN,
  hits.promotion.promoId AS promotion_id__STRING,
  hits.promotion.promoName AS promotion_name__STRING,
  hits.promotion.promoCreative AS promotion_creative__STRING,
  hits.promotion.promoPosition AS promotion_position__STRING,
  hits.eventInfo.eventCategory AS event_category__STRING,
  hits.eventInfo.eventAction AS event_action__STRING,
  hits.eventInfo.eventLabel AS event_label__STRING,
  hits.eventInfo.eventValue AS event_value__INT,
  device.language AS device_language__STRING,
  device.screenResolution AS device_resolution__STRING,
  device.deviceCategory AS device_category__STRING,
  device.operatingSystem AS device_os__STRING,
  geoNetwork.country AS geo_country__STRING,
  geoNetwork.region AS geo_region__STRING,
  hits.page.searchKeyword AS hit_search_keyword__STRING,
  hits.page.searchCategory AS hits_search_category__STRING,
  hits.page.pageTitle AS hits_page_title__STRING,
  hits.page.pagePath AS page_path__STRING,
  hits.page.hostname AS page_hostname__STRING,
  hits.eCommerceAction.action_type AS commerce_action_type__INT,
  hits.eCommerceAction.step AS commerce_action_step__INT,
  hits.eCommerceAction.option AS commerce_action_option__STRING,
  hits.product.productSKU AS product_sku__STRING,
  hits.product.v2ProductName AS product_name__STRING,
  hits.product.productRevenue AS product_revenue__INT,
  hits.product.productPrice AS product_price__INT,
  hits.product.productQuantity AS product_quantity__INT,
  hits.product.productRefundAmount AS hits.product.product_refund_amount__INT,
  hits.product.v2ProductCategory AS product_category__STRING,
  hits.transaction.transactionId AS transaction_id__STRING,
  hits.transaction.transactionCoupon AS transaction_coupon__STRING,
  hits.transaction.transactionRevenue AS transaction_revenue__INT,
  hits.transaction.transactionTax AS transaction_tax__INT,
  hits.transaction.transactionShipping AS transaction_shipping__INT,
  hits.transaction.affiliation AS transaction_affiliation__STRING,
  hits.appInfo.screenName AS app_current_name__STRING,
  hits.appInfo.screenDepth AS app_screen_depth__INT,
  hits.appInfo.landingScreenName AS app_landing_screen__STRING,
  hits.appInfo.exitScreenName AS app_exit_screen__STRING,
  hits.exceptionInfo.description AS exception_description__STRING,
  hits.exceptionInfo.isFatal AS exception_is_fatal__BOOLEAN
FROM
  [98513938.ga_sessions_20151112]
 HAVING
  customer_id__LONG IS NOT NULL
  AND customer_id__LONG != 'NA'
  AND customer_id__LONG != ''

我将此表的结果写入另一个表 denorm (展平,大数据集)。

当我使用

子句查询 denorm 时,我得到了不同的结果
WHERE session_id_STRING = "100001897901013346771447300813"

与包含上述查询(产生所需结果)

SELECT * FROM (_above query_) as foo where session_id_STRING = 100001897901013346771447300813

我确定这是设计上的,但如果有人能解释这两种方法之间的差异会非常有用吗?

2 个答案:

答案 0 :(得分:0)

我相信你说你确实选中了“#Flat; Flatten Results&#34;什么时候创建输出表?我从你的问题中假设session_id_STRING是一个重复的字段?

如果这些是正确的假设,那么您所看到的正是您从上述文档中引用的行为。你让BigQuery要求&#34;压扁结果&#34;所以它将你重复的字段变成了一个不重复的字段,并复制了它周围的所有字段,这样你就有了一个平面(即没有重复的数据)表。

如果您在查询子查询时看到所需的行为,则应在创建表时取消选中该框。

答案 1 :(得分:0)

  

查看文档,我看到当我选择一个字段时   在记录中它会自动压平该记录   复制周围的列。

这不正确。顺便说一句,请你指出文档 - 它需要改进。

选择字段不会使该记录变平。所以,如果你有一个表T,只有一个记录{a = 1,b =(2,2,3)},那么

SELECT * FROM T WHERE b = 2

您仍然可以获得单个记录{a = 1,b =(2,2)}。此子查询中的SELECT COUNT(a)将返回1.

但是一旦用flatten = on编写此查询的结果,就会得到两条记录:{a = 1,b = 2},{a = 1,b = 2}。来自展平表的SELECT COUNT(a)将返回2.