BigQuery从项目中的所有表中选择__TABLES__?

时间:2017-04-17 18:53:15

标签: sql google-bigquery

使用BigQuery,我可以从项目中的每个数据集中选择__TABLES__吗?我试过SELECT * FROM '*.__TABLES'但是在BigQuery中不允许这样做。任何帮助都会很棒,谢谢!

7 个答案:

答案 0 :(得分:1)

__TABLES__语法仅支持特定数据集,不适用于数据集

你可以做的事情如下

  
#standardSQL
WITH ALL__TABLES__ AS (
  SELECT * FROM `bigquery-public-data.1000_genomes.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.baseball.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.bls.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.census_bureau_usa.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.cloud_storage_geo_index.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.cms_codes.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.common_us.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.fec.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.genomics_cannabis.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.ghcn_d.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.ghcn_m.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.github_repos.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.hacker_news.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.irs_990.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.medicare.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.new_york.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.nlm_rxnorm.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.noaa_gsod.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.open_images.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.samples.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.san_francisco.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.stackoverflow.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.usa_names.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.utility_us.__TABLES__` 
)
SELECT *
FROM ALL__TABLES__

在这种情况下,您需要提前了解数据集列表,您可以通过Datasets: list API或使用相应的bq ls

轻松完成这些数据集。

请注意:上述方法仅适用于数据位于同一位置的数据集。如果您的数据集包含不同位置的数据,则需要在两个不同的查询中查询它们

例如:

#standardSQL
WITH ALL_EU__TABLES__ AS (
  SELECT * FROM `bigquery-public-data.common_eu.__TABLES__` UNION ALL
  SELECT * FROM `bigquery-public-data.utility_eu.__TABLES__` 
)
SELECT *
FROM ALL_EU__TABLES__

答案 1 :(得分:1)

我知道您要求使用BigQuery,但是我做了一个Python脚本来获取您所要求的信息,也许可以帮助其他编码人员:

Pip安装:

!pip install google-cloud
!pip install google-api-python-client
!pip install oauth2client

代码:

import subprocess
import sys
import threading

from google.cloud import bigquery

def _worker_query(project, dataset_id, results_scan ):
    query_str = 'SELECT * FROM `{}.{}.__TABLES__`'.format(project, dataset_id )
    QUERY = (query_str)
    query_job = client.query(QUERY)
    rows = query_job.result()
    count=0;
    for row in rows:
        count = count+1
    results_scan.append({'dataset_id':dataset_id, 'count':count})

def main_execute():


    project = 'bigquery-public-data'
    dataset = client.list_datasets(project)
    count = 0

    threads_project = []
    results_scan = []

    for d in dataset:
        t = threading.Thread(target=_worker_query, args=(project,d.dataset_id, results_scan))
        threads_project.append(t)
        t.start()

    for t in threads_project:
        t.join()

    total_count = 0
    for result in results_scan:
        print(result)
        total_count =  total_count + result['count']

    print('\n\nTOTAL TABLES: "{}"'.format(total_count))

JSON_FILE_NAME = 'sa_bq.json'
client = bigquery.Client.from_service_account_json(JSON_FILE_NAME)
main_execute()

答案 2 :(得分:1)

Mikhail Berlyant的答案很好。我想补充一点,在某些情况下可以使用一种更清洁的方式。

因此,如果您只有一个数据集,则这些表位于同一数据集中,并且它们遵循一种模式,则可以使用wildcard table来查询它们。

假设您要查询noaa_gsod数据集(其表具有以下名称gsod1929,gsod1930,... 2018、2019),然后只需使用

FROM
  `bigquery-public-data.noaa_gsod.gsod*`

这将匹配noaa_gsod数据集中所有以字符串gsod开头的表。

答案 3 :(得分:1)

建立在@mikhail-berlyant 上面很好的解决方案的基础上,现在可以利用 BigQuery 的脚本功能来自动收集数据集列表和检索表元数据。只需替换 *_name 变量即可为给定项目中的所有表生成元数据视图。

DECLARE project_name STRING;
DECLARE dataset_name STRING;
DECLARE table_name STRING;
DECLARE view_name STRING;
DECLARE generate_metadata_query_for_all_datasets STRING;
DECLARE retrieve_table_metadata STRING;
DECLARE persist_table_metadata STRING;
DECLARE create_table_metadata_view STRING;

SET project_name = "your-project";
SET dataset_name = "your-dataset";
SET table_name = "your-table";
SET view_name = "your-view";

SET generate_metadata_query_for_all_datasets = CONCAT("SELECT STRING_AGG( CONCAT(\"select * from `",project_name,".\", schema_name, \".__TABLES__` \"), \"union all \\n\" ) AS datasets FROM `",project_name,"`.INFORMATION_SCHEMA.SCHEMATA");
SET
  retrieve_table_metadata = generate_metadata_query_for_all_datasets;
SET create_table_metadata_view = CONCAT(
"""
 CREATE VIEW IF NOT EXISTS
`""",project_name,".",dataset_name,".",view_name,"""`
AS
 SELECT
  project_id
  ,dataset_id
  ,table_id
  ,DATE(TIMESTAMP_MILLIS(creation_time)) AS created_date
  ,TIMESTAMP_MILLIS(creation_time) AS created_at
  ,DATE(TIMESTAMP_MILLIS(last_modified_time)) AS last_modified_date
  ,TIMESTAMP_MILLIS(last_modified_time) AS last_modified_at
  ,row_count
  ,size_bytes
  ,round(safe_divide(size_bytes, (1000*1000)),1) as size_mb
  ,round(safe_divide(size_bytes, (1000*1000*1000)),2) as size_gb
  ,CASE
    WHEN type = 1 THEN 'native table'
    WHEN type = 2 THEN 'view'
    WHEN type = 3 THEN 'external table'
    ELSE 'unknown'
  END AS type
 FROM `""",project_name,".",dataset_name,".",table_name,"""`
 ORDER BY dataset_id, table_id asc""");
EXECUTE IMMEDIATE retrieve_table_metadata INTO persist_table_metadata;
EXECUTE IMMEDIATE CONCAT("CREATE OR REPLACE TABLE `",project_name,".",dataset_name,".",table_name,"` AS (",persist_table_metadata,")");
EXECUTE IMMEDIATE create_table_metadata_view;

之后您可以查询您的新视图。

    SELECT * FROM `[PROJECT ID].[DATASET ID].[VIEW NAME]`

答案 4 :(得分:0)

您可以使用此SQL查询生成项目中数据集的列表:

select  string_agg(
      concat("select * from `[PROJECT ID].", schema_name, ".__TABLES__` ")
    , "union all \n"
)
from `[PROJECT ID]`.INFORMATION_SCHEMA.SCHEMATA;

您将获得以下列表:

select * from `[PROJECT ID].[DATASET ID 1].__TABLES__` union all 
select * from `[PROJECT ID].[DATASET ID 2].__TABLES__` union all 
select * from `[PROJECT ID].[DATASET ID 3].__TABLES__` union all 
select * from `[PROJECT ID].[DATASET ID 4].__TABLES__` 
...

然后将列表放入此查询中:

SELECT 
    table_id
    ,DATE(TIMESTAMP_MILLIS(creation_time)) AS creation_date
    ,DATE(TIMESTAMP_MILLIS(last_modified_time)) AS last_modified_date
    ,row_count
    ,size_bytes
    ,round(safe_divide(size_bytes, (1000*1000)),1) as size_mb
    ,round(safe_divide(size_bytes, (1000*1000*1000)),2) as size_gb
    ,CASE
        WHEN type = 1 THEN 'table'
        WHEN type = 2 THEN 'view'
        WHEN type = 3 THEN 'external'
        ELSE '?'
     END AS type
    ,TIMESTAMP_MILLIS(creation_time) AS creation_time
    ,TIMESTAMP_MILLIS(last_modified_time) AS last_modified_time
    ,FORMAT_TIMESTAMP("%Y-%m", TIMESTAMP_MILLIS(last_modified_time)) as last_modified_month
    ,dataset_id
    ,project_id
FROM 
(   
    select * from `[PROJECT ID].[DATASET ID 1].__TABLES__` union all 
    select * from `[PROJECT ID].[DATASET ID 2].__TABLES__` union all 
    select * from `[PROJECT ID].[DATASET ID 3].__TABLES__` union all 
    select * from `[PROJECT ID].[DATASET ID 4].__TABLES__`
)
ORDER BY dataset_id, table_id asc 

答案 5 :(得分:0)

您可以扩展 Mikhail Berlyant 的答案,并使用单个查询自动生成 SQL。

INFORMATION_SCHEMA.SCHEMATA 列出了所有数据集。您可以使用 WHILE 循环动态生成所有 UNION ALL 语句,如下所示:

DECLARE schemas ARRAY<string>;
DECLARE query string;
DECLARE i INT64 DEFAULT 0;
DECLARE arrSize INT64;

SET schemas = ARRAY(select schema_name from <your_project>.INFORMATION_SCHEMA.SCHEMATA);
SET query = "SELECT * FROM (";
SET arrSize = ARRAY_LENGTH(schemas);

WHILE i < arrSize - 1 DO
  SET query = CONCAT(query, "SELECT '", schemas[OFFSET(i)], "', table_ID, row_count, size_bytes from <your project>.", schemas[OFFSET(i)], '.__TABLES__ UNION ALL ');
  SET i = i + 1;
END WHILE;

SET query = CONCAT(query, "SELECT '", schemas[ORDINAL(arrSize)], "', table_ID, row_count, size_bytes from <your project>.", schemas[ORDINAL(arrSize)], '.__TABLES__` )');

EXECUTE IMMEDIATE query;

答案 6 :(得分:0)

也许您可以使用 INFORMATION_SCHEMA 而不是 TABLES

SELECT * FROM region-us.INFORMATION_SCHEMA.TABLES;

只需将 region-us 替换为数据集所在的区域。 如果您有多个区域,则需要使用 UNION ALL.. 但它比对所有数据集使用 UNION 更简单。

或者您可以使用查询来获取所有联合,如下所示:

With SelectTable AS (
SELECT 1 AS ID,'SELECT * FROM '|| table_schema ||'.__TABLES__ UNION ALL' AS SelectColumn FROM region-us.INFORMATION_SCHEMA.TABLES
GROUP BY table_schema
)
Select STRING_AGG(SelectColumn,'\n') FROM SelectTable
GROUP BY ID