在Google BigQuery中使用深度排序的广义数据透视表

时间:2017-02-17 20:44:24

标签: sql google-bigquery

这是Multi-level pivot in Google BigQuery的后续问题,我想知道是否可以使用单个查询在Google BigQuery中构建嵌套数据透视表。它是,所以在这个后续问题中,我想探讨一般情况。

以下是我正在使用的数据示例(也包含在此共享Google Sheet中):

enter image description here

现在,我想构建一个具有以下属性的数据透视表:

  • 行和col级别的嵌套级别(前一个问题只有嵌套cols)
  • 行和列中的小计(前一个只有总计)
  • 多个指标(之前只有一个指标)
  • 多种排序 - 深度指标和按字母顺序排列(之前没有任何排序条件)
  • 限制(之前没有任何限制)

以下是Google表格中内置的支点 -

enter image description here

这里的概念性SQL语句是:

SELECT
    SUM(price),
    COUNT(price) 
BROKEN DOWN BY
    Studio (row),
    Title (row)
    Territory ID (col),
    Type (col)
SORTED/LIMITED BY
    Studio ==> A-Z, LIMIT 3,
    Title ==> SUM(price) in GRAND TOTAL DESC, LIMIT 4,
    Territory ID ==> COUNT(price) in Paramount TOTAL, LIMIT 2
    Type ==> A-Z, NO LIMIT

我不确定如何在概念上显示Subtotals,但我们应该能够为每个细分字段指定那些。

是否可以在Google BigQuery的单个SQL语句中执行上述操作?生成它的步骤是什么?

1 个答案:

答案 0 :(得分:2)

  

即可。如果我们进行聚合并获得10M结果怎么办?除非我们在bigquery中应用限制等,否则传输的数据量将会非常大......

  

让我们在这里澄清挑战:

通常情况下,你会在后端运行类似下面的内容并将结果拉到可视化工具(前端)以进行进一步的操作,如排序,限制,旋转等。

#standardSQL
SELECT
  Studio, 
  Title, 
  TerritoryID,
  Type, 
  SUM(Price) AS Price, 
  COUNT(1) AS Volume
FROM YourTable  
GROUP BY Studio, Title, TerritoryID, Type   

正如您所提到的,在您的情况下,这样的结果可以轻松生成10M +行和您希望减少它的大小,而不会影响您仍然在前端的数据透视/可视化中显示最终数据的能力

  

<强> A 即可。建议/解决方案

下面显示了如何通过在后端应用排序和限制来实现这一目标(因此结果大小大幅减少),而不会失去进行旋转的能力,仍然显示总数等。

让我们从简化的

开始进行最终查询
  • 初始查询(骨架)

我们假设,根据已知标准,我们事先知道应选择哪个工作室,标题,地区和类型 在这种情况下,下面的查询将返回所需的数据

#standardSQL
WITH Studios AS (
  SELECT 'Fox' 
  UNION ALL SELECT 'Paramouont' 
),
Titles AS (
  SELECT 'Fox' AS Studio,'Best Laid Plans' AS Title
  UNION ALL SELECT 'Fox','Homecoming'
  UNION ALL SELECT 'Paramount','Titanic'
  UNION ALL SELECT 'Paramount','Homecoming'
),
Territories AS (
  SELECT 'US' AS TerritoryID
  UNION ALL SELECT 'GB'
),
Totals AS (
  SELECT 
    IFNULL(b.Studio,'Other') AS Studio, 
    IFNULL(b.Title,'Other') AS Title, 
    IFNULL(c.TerritoryID,'Other') AS TerritoryID, 
    Type,
    ROUND(SUM(Price), 2) AS Price, COUNT(1) AS Volume
  FROM yourTable AS a 
  LEFT JOIN Titles AS b ON a.Studio = b.Studio AND a.Title = b.Title
  LEFT JOIN Territories AS c ON a.TerritoryID = c.TerritoryID
  GROUP BY Studio, Title, TerritoryID, Type
)
SELECT * FROM Totals
ORDER BY Studio, Title, TerritoryID, Type

输出将如下所示

Studio      Title           TerritoryID Type        Price    Volume  
Fox         Best Laid Plans GB          Movie         87.32    18    
Fox         Best Laid Plans GB          TV Episode    50.17    23    
Fox         Best Laid Plans Other       TV Episode  1131.0      2    
Fox         Best Laid Plans US          Movie        120.82    18    
Fox         Best Laid Plans US          TV Episode    53.76    24    
Fox         Homecoming      GB          TV Episode    60.22    28    
Fox         Homecoming      Other       TV Episode  2262.0      4    
Fox         Homecoming      US          TV Episode   128.45    58    
Other       Other           GB          Movie        142.71    29    
Other       Other           GB          TV Episode    84.8     40    
Other       Other           Other       Movie       3292.0      4    
Other       Other           Other       TV Episode  3282.0     16    
Other       Other           US          Movie         52.92     8    
Other       Other           US          TV Episode   233.05   101    
Paramount   Homecoming      GB          Movie         18.96     4    
Paramount   Homecoming      US          Movie        124.84    16    
Paramount   Titanic         GB          Movie         41.92     8    
Paramount   Titanic         Other       Movie         12.0      4    
Paramount   Titanic         US          Movie        139.84    16   

您可以轻松地将其反馈给您的UI,以任何您需要的方式将其可视化

  • “最终”查询

现在,让我们为每个维度实施实际标准,而不是所有相关维度中的硬编码值。
所以下面的查询(与上面的骨架查询相比)中的唯一变化是以下CTE:工作室,标题和地区

#standardSQL
WITH Studios AS (
  SELECT DISTINCT Studio 
  FROM yourTable 
  ORDER BY Studio LIMIT 3
),
Titles AS (
  SELECT Studio, Title 
  FROM (
    SELECT Studio, Title, ROW_NUMBER() OVER(PARTITION BY Studio ORDER BY PRICE DESC) AS pos
    FROM (SELECT Studio, Title, SUM(Price) AS Price FROM yourTable GROUP BY Studio, Title)
  ) WHERE pos <= 4
),
Territories AS (
  SELECT TerritoryID FROM yourTable  
  WHERE Studio = 'Paramount' GROUP BY TerritoryID
  ORDER BY COUNT(1) DESC LIMIT 2
),
Totals AS (
  SELECT 
    IFNULL(b.Studio,'Other') AS Studio, 
    IFNULL(b.Title,'Other') AS Title, 
    IFNULL(c.TerritoryID,'Other') AS TerritoryID, 
    Type,
    ROUND(SUM(Price), 2) AS Price, COUNT(1) AS Volume
  FROM yourTable AS a 
  LEFT JOIN Titles AS b ON a.Studio = b.Studio AND a.Title = b.Title
  LEFT JOIN Territories AS c ON a.TerritoryID = c.TerritoryID
  GROUP BY Studio, Title, TerritoryID, Type
)
SELECT * FROM Totals
WHERE NOT 'Other' IN (TerritoryID)
ORDER BY Studio, TerritoryID DESC, Type, Price DESC, Title

结果如下:

Studio      Title           TerritoryID Type        Price  Volume    
Fox         Best Laid Plans         US  Movie       120.82  18   
Fox         Titanic                 US  Movie        52.92   8   
Fox         1:00 P.M. - 2:00 P.M.   US  TV Episode  187.25  81   
Fox         Homecoming              US  TV Episode  128.45  58   
Fox         Best Laid Plans         US  TV Episode   53.76  24   
Fox         Best Laid Plans         GB  Movie        87.32  18   
Fox         Titanic                 GB  Movie        78.84  16   
Fox         1:00 P.M. - 2:00 P.M.   GB  TV Episode   61.42  28   
Fox         Homecoming              GB  TV Episode   60.22  28   
Fox         Best Laid Plans         GB  TV Episode   50.17  23   
Paramount   Titanic                 US  Movie       139.84  16   
Paramount   Homecoming              US  Movie       124.84  16   
Paramount   Titanic                 GB  Movie        41.92   8   
Paramount   Homecoming              GB  Movie        18.96   4   
Sony        Best Laid Plans         US  TV Episode   22.9   10   
Sony        Homecoming              US  TV Episode   22.9   10   
Sony        Best Laid Plans         GB  Movie        63.87  13   
Sony        Homecoming              GB  TV Episode   18.81   9   
Sony        Best Laid Plans         GB  TV Episode    4.57   3       

这里的重点是 - 而BigQuery在分析数十亿行和提取所需信息方面非常有效,使用BigQuery实际定制结果数据以反映此结果实际上是如何实现的非常低效在客户端UI上的表示层中呈现。相反 - 您应该将此数据传递给UI并使用可视化代码来处理它