如何使用BigQuery连接器从Java Spark读取BigQuery表

时间:2019-01-14 13:58:31

标签: java apache-spark google-bigquery google-cloud-dataproc

我正在尝试通过如下的Spark Java代码读取bigquery表:

    BigQuerySQLContext bqSqlCtx = new BigQuerySQLContext(sqlContext);
    bqSqlCtx.setGcpJsonKeyFile("sxxxl-gcp-1x4c0xxxxxxx.json");
    bqSqlCtx.setBigQueryProjectId("winged-standard-2xxxx");
    bqSqlCtx.setBigQueryDatasetLocation("asia-east1");
    bqSqlCtx.setBigQueryGcsBucket("dataproc-9cxxxxx39-exxdc-4e73-xx07- 2258xxxx4-asia-east1");
    Dataset<Row> testds = bqSqlCtx.bigQuerySelect("select * from bqtestdata.customer_visits limit 100");

但是我面临以下问题:

19/01/14 10:52:01 WARN org.apache.spark.sql.SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
19/01/14 10:52:01 INFO com.samelamin.spark.bigquery.BigQueryClient: Executing query select * from bqtestdata.customer_visits limit 100
19/01/14 10:52:02 INFO com.samelamin.spark.bigquery.BigQueryClient: Creating staging dataset winged-standard-2xxxxx:spark_bigquery_staging_asia-east1

Exception in thread "main" java.util.concurrent.ExecutionException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 

400 Bad Request
{
  "code" : 400,
  "errors" : 
[ {
    "domain" : "global",
    **"message" : "Invalid dataset ID \"spark_bigquery_staging_asia-east1\". Dataset IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long.",**
    "reason" : "invalid"
  } ],
  "message" : "Invalid dataset ID \"spark_bigquery_staging_asia-east1\". Dataset IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long.",
  "status" : "INVALID_ARGUMENT"
}

2 个答案:

答案 0 :(得分:1)

响应中的消息

  

Dataset IDs must be alphanumeric (plus underscores)...

表示dataset ID“ spark_bigquery_staging_asia-east1”无效,因为其中包含连字符,尤其是在asia-east1中。

答案 1 :(得分:0)

我对samelamin的Scala库有类似的问题。显然,这是由于该库无法处理美国和欧盟以外的其他位置,因此该库将无法访问asia-east1中的数据集。

目前,我正在使用BigQuery Spark Connector从BigQuery加载和写入数据。

如果您能够使用此库,请也将其共享。