Question

我目前正在编写一个软件，用于导出大量BigQuery数据并将查询结果作为CSV文件存储在本地。我使用Python 3和谷歌提供的客户端。我做了配置和验证，但问题是，我无法在本地存储数据。每次执行时，我都会收到错误消息：

googleapiclient.errors.HttpError：https：//www.googleapis.com/bigquery/v2/projects/round-office-769/jobs ?alt = json返回“无效的提取目标URI”响应/文件名 - * .csv'。必须是有效的Google存储路径。“＆gt;

这是我的工作配置：

def export_table(service, cloud_storage_path,
             projectId, datasetId, tableId, sqlQuery,
             export_format="CSV",
             num_retries=5):

# Generate a unique job_id so retries
# don't accidentally duplicate export
job_data = {
    'jobReference': {
        'projectId': projectId,
        'jobId': str(uuid.uuid4())
    },
    'configuration': {
        'extract': {
            'sourceTable': {
                'projectId': projectId,
                'datasetId': datasetId,
                'tableId': tableId,
            },
            'destinationUris': ['response/file-name-*.csv'],
            'destinationFormat': export_format
        },
        'query': {
            'query': sqlQuery,
        }
    }
}
return service.jobs().insert(
    projectId=projectId,
    body=job_data).execute(num_retries=num_retries)

我希望我可以使用本地路径而不是云存储来存储数据，但我错了。

所以我的问题是：

我可以在本地（或本地数据库）下载查询数据，还是必须使用Google云端存储？

Answer 1

您需要将Google云端存储用于导出作业。从here解释了从BigQuery导出数据，还要检查不同路径语法的变体。

然后，您可以将文件从GCS下载到本地存储。

Gsutil工具可以帮助您进一步将文件从GCS下载到本地计算机。

您无法在本地一次下载，首先需要导出到GCS，而不是转移到本地计算机。

Answer 2

您可以使用分页机制直接下载所有数据（无需通过Google云端存储路由）。基本上，您需要为每个页面生成页面标记，下载页面中的数据并重复此操作，直到所有数据都已下载，即不再有可用的标记。这是Java中的示例代码，希望澄清这个想法：

import com.google.api.client.googleapis.auth.oauth2.GoogleCredential;
import com.google.api.client.googleapis.javanet.GoogleNetHttpTransport;
import com.google.api.client.http.HttpTransport;
import com.google.api.client.json.JsonFactory;
import com.google.api.client.json.JsonFactory;
import com.google.api.client.json.jackson2.JacksonFactory;
import com.google.api.services.bigquery.Bigquery;
import com.google.api.services.bigquery.BigqueryScopes;
import com.google.api.client.util.Data;
import com.google.api.services.bigquery.model.*;

/* your class starts here */

private String projectId = ""; /* fill in the project id here */
private String query = ""; /* enter your query here */
private Bigquery bigQuery;
private Job insert;
private TableDataList tableDataList;
private Iterator<TableRow> rowsIterator;
private List<TableRow> rows;
private long maxResults = 100000L; /* max number of rows in a page */

/* run query */
public void open() throws Exception {
    HttpTransport transport = GoogleNetHttpTransport.newTrustedTransport();
    JsonFactory jsonFactory = new JacksonFactory();
    GoogleCredential credential = GoogleCredential.getApplicationDefault(transport, jsonFactory);
    if (credential.createScopedRequired())
        credential = credential.createScoped(BigqueryScopes.all());
    bigQuery = new Bigquery.Builder(transport, jsonFactory, credential).setApplicationName("my app").build();

    JobConfigurationQuery queryConfig = new JobConfigurationQuery().setQuery(query);
    JobConfiguration jobConfig = new JobConfiguration().setQuery(queryConfig);
    Job job = new Job().setConfiguration(jobConfig);
    insert = bigQuery.jobs().insert(projectId, job).execute();
    JobReference jobReference = insert.getJobReference();

    while (true) {
        Job poll = bigQuery.jobs().get(projectId, jobReference.getJobId()).execute();
        String state = poll.getStatus().getState();
        if ("DONE".equals(state)) {
            ErrorProto errorResult = poll.getStatus().getErrorResult();
            if (errorResult != null)
                throw new Exception("Error running job: " + poll.getStatus().getErrors().get(0));
            break;
        }
        Thread.sleep(10000);
    }

    tableDataList = getPage();
    rows = tableDataList.getRows();
    rowsIterator = rows != null ? rows.iterator() : null;
}

/* read data row by row */
public /* your data object here */ read() throws Exception {
    if (rowsIterator == null) return null;

    if (!rowsIterator.hasNext()) {
        String pageToken = tableDataList.getPageToken();
        if (pageToken == null) return null;
        tableDataList = getPage(pageToken);
        rows = tableDataList.getRows();
        if (rows == null) return null;
        rowsIterator = rows.iterator();
    }

    TableRow row = rowsIterator.next();
    for (TableCell cell : row.getF()) {
        Object value = cell.getV();
        /* extract the data here */
    }

    /* return the data */
}

private TableDataList getPage() throws IOException {
    return getPage(null);
}

private TableDataList getPage(String pageToken) throws IOException {
    TableReference sourceTable = insert
            .getConfiguration()
            .getQuery()
            .getDestinationTable();
    if (sourceTable == null)
        throw new IllegalArgumentException("Source table not available. Please check the query syntax.");
    return bigQuery.tabledata()
            .list(projectId, sourceTable.getDatasetId(), sourceTable.getTableId())
            .setPageToken(pageToken)
            .setMaxResults(maxResults)
            .execute();
}

Answer 3

You can run a tabledata.list() operation on that table and set "alt=csv" which will return the beginning of the table as CSV.

Answer 4

另一种方法是通过用户界面，一旦返回查询结果，您就可以选择“下载为CSV”按钮。

Answer 5

如果您安装了Google BigQuery API和pandas以及pandas.io，您可以在Jupyter笔记本中运行Python，查询BQ表，并将数据导入本地数据帧。从那里，您可以将其写入CSV。

Answer 6

Mikhail Berlyant说，

BigQuery不提供直接导出/下载查询的功能结果保存到GCS或本地文件中。

您仍然可以通过三个步骤使用Web UI导出它

配置查询以将结果保存到BigQuery表中并运行它。
将表格导出到GCS中的存储桶。
从存储桶中下载。

要确保成本低廉，只需确保在将内容导出到GCS后删除表，并在将文件下载到计算机后从存储桶和存储桶中删除内容。

步骤1

在BigQuery屏幕中，运行查询之前，请转到“更多>查询设置”

这将打开以下内容

您想在这里

目标：设置查询结果的目标表
项目名称：选择项目。
数据集名称：选择一个数据集。如果您没有，请创建它并回来。
表名：提供您想要的任何名称（必须仅包含字母，数字或下划线）。
结果大小：允许较大的结果（没有大小限制）。

然后将其保存，并将查询配置为保存在特定表中。现在您可以运行查询。

步骤2

要将其导出到GCP，您必须转到表格，然后点击导出>导出到GCS。

这将打开以下屏幕

在选择GCS位置中，定义存储区，文件夹和文件。

例如，您有一个名为 daria_bucket 的存储桶（仅使用小写字母，数字，连字符（-）和下划线（_）。点（。）可以用于形成一个有效的域名。），并想将文件保存到名称为 test 的存储桶的根目录中，然后编写（在“选择GCS位置”中）

daria_bucket/test.csv

如果文件太大（超过1 GB），则会出现错误。要修复它，您必须使用通配符将其保存在更多文件中。因此，您需要添加*，就像这样

daria_bucket/test*.csv

这将在表daria_bucket的内部存储从表中提取的所有数据到多个文件中，这些文件名为test000000000000，test000000000001，test000000000002，... testX。

步骤3

然后转到存储，您将看到存储桶。

深入其中，您会找到一个（或多个）文件。然后您可以从那里下载。

Answer 7

使用Python熊猫将数据从BigQuery表导出到CSV文件：

import pandas as pd
from google.cloud import bigquery

selectQuery = """SELECT * FROM dataset-name.table-name"""
bigqueryClient = bigquery.Client()
df = bigqueryClient.query(selectQuery).to_dataframe()
df.to_csv("file-name.csv", index=False)

Answer 8

也许您可以使用Google提供的simba odbc驱动程序，也可以使用任何提供odbc连接的工具来创建csv。甚至可以是Microsoft sis，甚至不需要编写代码。

在不使用Google云端存储的情况下将BigQuery数据导出为CSV

8 个答案:

步骤1

步骤2

步骤3