如何在Google Big Query中列出所有数据集中所有表的大小

时间:2019-01-21 04:22:15

标签: python sql r python-3.x google-bigquery

我正在尝试找出如何列出Google Big Query中所有项目中所有表的所有大小。可能是多个表的SQL联合。虽然,我在这里查看很多表,所以我想要某种自动化的解决方案。我可以使用R代码执行此任务。甚至我冷甚至使用Python来做到这一点。如果此处有人可以列出某些指标(主要是每个对象(表)的大小)以及其他相关指标的解决方案,请在此处共享。非常感谢!

3 个答案:

答案 0 :(得分:2)

@Enle Lin我实际上在您的代码中发现了一个问题,因为它无法处理被拉出的项目未启用BigQuery API的异常,并使用了错误的变量来代替name而不是projectId。因此,对您的代码进行了调整,并将转换后的字节提取到GiB中(只是认为它更相关)。请看下面:

from google.cloud import bigquery
from google.cloud.bigquery import Dataset
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials

#Leverage the Application Default Credentials for authentication
credentials = GoogleCredentials.get_application_default()
service = discovery.build('cloudresourcemanager', 'v1', credentials=credentials)

#List projects
request = service.projects().list()
response = request.execute()

#Main loop to list projects
for project in response.get('projects', []):
  try:
    client = bigquery.Client(project['projectId']) # Start the client in the right project

    #Loop to list datasets
    datasets = list(client.list_datasets())
    if datasets: # If there is some BQ dataset
        print('Datasets in project {}:'.format(project['projectId']))
        #Loop to list the tables in each dataset
        for dataset in datasets:
            print(' - {}'.format(dataset.dataset_id))
            get_sizeGiB = client.query("select table_id, (size_bytes /1073741824) as sizeGiB from "+dataset.dataset_id+".__TABLES__") # This query retrieves all the tables in the dataset and the size in GiB. It can be modified to pull more fields.
            tables = get_sizeGiB.result() # Get the result
            #Loop to list the tables and print the size
            for table in tables:
                print('\t{} sizeGiB: {}'.format(table.table_id,table.sizeGiB))
    else: print ('{} project does not contain any datasets.'.format(projectId))
  except Exception:
    pass

答案 1 :(得分:0)

选项1

到目前为止,当前要执行此操作的选项是使用Google API来获取项目/数据集/表信息并将其存储在本地表中。 既然您提到了很多数据集和表,我建议您使用无服务器方法来实现可伸缩性和处理速度

List Project

List dataset

List Table

选项2

BigQuery现在在其Beta program access to information schema中提供了服务,请检查一下这可能节省您的时间和精力

select * from `DATASET.INFORMATION_SCHEMA.TABLES`

select * from `DATASET.INFORMATION_SCHEMA.COLUMNS`

enter image description here

选项3

您可以查询__TABLES__以获得表格信息

select * from `project.__TABLES__`

enter image description here

答案 2 :(得分:0)

此示例在Python中列出了所有项目中的所有表及其大小(以字节为单位)。您可以以它为例来构建适合您的用例的脚本:

from google.cloud import bigquery
from google.cloud.bigquery import Dataset
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials

# credentials to list project
credentials = GoogleCredentials.get_application_default()
service = discovery.build('cloudresourcemanager', 'v1', credentials=credentials)

# list project
request = service.projects().list()
response = request.execute()

# Main loop for project
for project in response.get('projects', []):
    client = bigquery.Client(project['projectId']) # Start the client in the right project

    # list dataset
    datasets = list(client.list_datasets())
    if datasets: # If there is some BQ dataset
        print('Datasets in project {}:'.format(project['name']))
        # Second loop to list the tables in the dataset
        for dataset in datasets: 
            print(' - {}'.format(dataset.dataset_id))
            get_size = client.query("select table_id, size_bytes as size from "+dataset.dataset_id+".__TABLES__") # This query retrieve all the tables in the dataset and the size in bytes. It can be modified to get more fields.
            tables = get_size.result() # Get the result
            # Third loop to list the tables and print the result
            for table in tables:
                print('\t{} size: {}'.format(table.table_id,table.size))

参考:

列出项目:
https://cloud.google.com/resource-manager/reference/rest/v1/projects/list#embedded-explorer

列出数据集:
https://cloud.google.com/bigquery/docs/datasets#bigquery-list-datasets-python