google big query sql中的性能增强

时间:2017-03-10 18:28:26

标签: sql google-bigquery

在下面的google big查询中,我在Id,StartTime和StopTime上加入了两个表“Data”和“Location”。

由于数据按日期划分,因此我在WHERE clauase中具有基于PartitionTime的条件。

查询运行了很长时间(约20分钟),只是想知道我是否缺少一些性能技术来提高查询效率。

任何帮助将不胜感激。谢谢!!

  SELECT
    *
  FROM (
      SELECT
          A.Id AS Id, A.Id1 AS Id1, StartTime, StopTime, Latitude, Longitude, DateTime
      FROM
          `Data` AS A
      JOIN
        (SELECT * FROM `Location` WHERE _TABLE_SUFFIX IN ("01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18",
        "19","20","21", "22", "23","24", "26", "27", "28","29","30","31" )) AS B
      ON
        A.StartTime < B.DateTime
        AND A.StopTime >= B.DateTime
        AND A.Id = B.Id
  WHERE
    (A._PARTITIONTIME BETWEEN TIMESTAMP('2016-11-01')
      AND TIMESTAMP('2016-11-30'))
  ORDER BY
    B.Id,
    A.Id1,
    B.DateTime )
ORDER BY
  Id,
  Id1,
  DateTime

2 个答案:

答案 0 :(得分:1)

有几点想法:

  • 内部ORDER BY不需要,因为只有顶级ORDER BY会对查询结果产生影响。
  • 如果您要查询"25"以外的所有后缀,可以使用_TABLE_SUFFIX BETWEEN "01" AND "31" AND _TABLE_SUFFIX != "25"
  • 根据JOIN的类型,_PARTITIONTIME上的过滤器可能不会被按下&#34;&#34;避免自动读取额外数据,例如如果您实际使用的是RIGHT JOIN。如果是这种情况,请使用子查询,例如(SELECT * FROM YourTable WHERE _PARTITIONTIME BETWEEN ...) AS A RIGHT JOIN ...

如果您希望BigQuery工程师更详细地了解时间,您可以在问题中添加一个示例作业ID,然后有人可以提供帮助。

答案 1 :(得分:0)

我还会删除外部ORDER BY,因为我认为它是查询性能的主要杀手 将_PARTITIONTIME移至相应的表是另一个需要考虑的事项 在子选择中使用SELECT *不会影响性能和成本(因为它是最终的外部SELECT,它定义除WHERE和其他子句中使用的列之外还使用哪些列),但是作为一个好的练习我认为最好列出明确需要的列/字段

  
#standardSQL
SELECT
  A.Id AS Id, A.Id1 AS Id1, StartTime, StopTime, Latitude, Longitude, DateTime
FROM (
  SELECT Id, Id1, StartTime, StopTime 
  FROM `Data` 
  WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2016-11-01') AND TIMESTAMP('2016-11-30')
) AS A
JOIN (
  SELECT Latitude, Longitude, DateTime 
  FROM `Location` 
  WHERE _TABLE_SUFFIX IN ("01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18",
"19","20","21", "22", "23","24", "26", "27", "28","29","30","31" )
) AS B
ON  A.StartTime < B.DateTime
AND A.StopTime >= B.DateTime
AND A.Id = B.Id   

您也可以考虑以下声明中的“压缩”,如Elliott建议的那样,

WHERE _TABLE_SUFFIX IN ("01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18",
"19","20","21", "22", "23","24", "26", "27", "28","29","30","31" )  

但要小心,因为这会导致涉及不需要的表(如果您的数据集中有这样的表)。例如那些后缀为'011'或'046'等的那些。

另一个选择是 - 您可能在Data中的分区与Location中的后缀之间存在某种逻辑关系。如果是这样,你可以使用它来缩小JOIN,从而使其更具性能