Question

我打算在同一个项目中将一组表从一个数据集复制到另一个数据集。我在Ipython notebook中执行代码。

我使用以下代码获取要在变量“value”中复制的表名列表：

list = bq.DataSet('test:TestDataset')

for x in list.tables():
   if(re.match('table1(.*)',x.name.table_id)):
     value = 'test:TestDataset.'+ x.name.table_id

然后我尝试使用“bq cp”命令将表从一个数据集复制到另一个数据集。但我无法在笔记本中执行bq命令。

!bq cp $value proj1:test1.table1_20162020

注意：

我尝试使用bigquery命令来检查是否有与之关联的复制命令，但找不到。

任何帮助都将不胜感激!!

Answer 1

如果您正在使用BigQuery API和Python，则可以运行复制作业：

https://cloud.google.com/bigquery/docs/tables#copyingtable

从文档中复制Python示例：

$scope.updateFilteredList = function () {
        $scope.filter_id = $('#filter_id').val();

        console.log('filter_id: ' + $scope.filter_id);
        $scope.filteredList = $filter("filter")($scope.cars, $scope.filter_id);
        console.log('filteredList.length:' + $scope.filteredList.length);
    };

def copyTable(service): try: sourceProjectId = raw_input("What is your source project? ") sourceDatasetId = raw_input("What is your source dataset? ") sourceTableId = raw_input("What is your source table? ") targetProjectId = raw_input("What is your target project? ") targetDatasetId = raw_input("What is your target dataset? ") targetTableId = raw_input("What is your target table? ") jobCollection = service.jobs() jobData = { "projectId": sourceProjectId, "configuration": { "copy": { "sourceTable": { "projectId": sourceProjectId, "datasetId": sourceDatasetId, "tableId": sourceTableId, }, "destinationTable": { "projectId": targetProjectId, "datasetId": targetDatasetId, "tableId": targetTableId, }, "createDisposition": "CREATE_IF_NEEDED", "writeDisposition": "WRITE_TRUNCATE" } } } insertResponse = jobCollection.insert(projectId=targetProjectId, body=jobData).execute() # Ping for status until it is done, with a short pause between calls. import time while True: status = jobCollection.get(projectId=targetProjectId, jobId=insertResponse['jobReference']['jobId']).execute() if 'DONE' == status['status']['state']: break print 'Waiting for the import to complete...' time.sleep(10) if 'errors' in status['status']: print 'Error loading table: ', pprint.pprint(status) return print 'Loaded the table:' , pprint.pprint(status)#!!!!!!!!!! # Now query and print out the generated results table. queryTableData(service, targetProjectId, targetDatasetId, targetTableId) except HttpError as err: print 'Error in loadTable: ', pprint.pprint(err.resp)命令在内部基本相同（您也可以调用该函数，具体取决于您导入的bq cp。）

Answer 2

我不确定为什么它不适合你，因为它对我来说非常适合。

projectFrom = 'project1'
datasetFrom = 'dataset1'
tableSearchString = 'test1'

projectTo = 'project2'
datasetTo = 'dataset2'

tables = bq.DataSet(projectFrom + ':' + datasetFrom).tables()

for table in tables:
  if tableSearchString in table.name.table_id:

    tableFrom = projectFrom + ':' + datasetFrom + '.' + table.name.table_id
    tableTo = projectTo + ':' + datasetTo + '.' + table.name.table_id

    !bq cp $tableFrom $tableTo

在笔记本中试试这个，因为它适用于我。
只是想知道，从脚本返回的错误代码是什么？

Answer 3

我认为这会对您有所帮助。

    tables = source_dataset.list_tables()
    for table in tables:
        #print table.name
        job_id = str(uuid.uuid4())
        dest_table = dest_dataset.table(table.name)
        source_table = source_dataset.table(table.name)
        if not dest_table.exists():
            job = self.bigquery_client.copy_table(job_id, dest_table, source_table)
            job.create_disposition = (google.cloud.bigquery.job.CreateDisposition.CREATE_IF_NEEDED)
            job.begin()
            job.result()

Answer 4

我创建了以下脚本，通过几次验证将所有表从一个数据集复制到另一个数据集。

from google.cloud import bigquery

client = bigquery.Client()

projectFrom = 'source_project_id'
datasetFrom = 'source_dataset'

projectTo = 'destination_project_id'
datasetTo = 'destination_dataset'

# Creating dataset reference from google bigquery cient
dataset_from = client.dataset(dataset_id=datasetFrom, project=projectFrom)
dataset_to = client.dataset(dataset_id=datasetTo, project=projectTo)

for source_table_ref in client.list_dataset_tables(dataset=dataset_from):
    # Destination table reference
    destination_table_ref = dataset_to.table(source_table_ref.table_id)

    job = client.copy_table(
      source_table_ref,
      destination_table_ref)

    job.result()
    assert job.state == 'DONE'

    dest_table = client.get_table(destination_table_ref)
    source_table = client.get_table(source_table_ref)

    assert dest_table.num_rows > 0 # validation 1  
    assert dest_table.num_rows == source_table.num_rows # validation 2

    print ("Source - table: {} row count {}".format(source_table.table_id,source_table.num_rows ))
    print ("Destination - table: {} row count {}".format(dest_table.table_id, dest_table.num_rows))

Answer 5

假设您要复制大多数表，则可以先copy the entire BigQuery dataset，然后删除一些您不想复制的表。

副本数据集UI与副本表相似。只需单击源数据集中的“复制数据集”按钮，然后在弹出表单中指定目标数据集。您可以将数据集复制到另一个项目或另一个区域。在下面查看有关如何复制数据集的屏幕截图。

复制数据集按钮

复制数据集表格

Answer 6

现在可以在 BigQuery Data Transfer Service 中使用应对数据集功能。在BigQuery Web控制台中选择传输服务，并填写源和目标详细信息，然后按需运行它，或在指定的时间间隔安排它。

或者只需运行以下gcloud命令即可实现

bq mk --transfer_config --project_id=[PROJECT_ID] --data_source=[DATA_SOURCE] --target_dataset=[DATASET] --display_name=[NAME] --params='[PARAMETERS]'

在google big query

6 个答案: