我使用Bigrquery软件包尝试了两种方式,
library(bigrquery)
library(DBI)
con <- dbConnect(
bigrquery::bigquery(),
project = "YOUR PROJECT ID HERE",
dataset = "YOUR DATASET"
)
test<- dbGetQuery(con, sql, n = 10000, max_pages = Inf)
和
sql <- `YOUR LARGE QUERY HERE` #long query saved to View and its select here
tb <- bigrquery::bq_project_query(project, sql)
bq_table_download(tb, max_results = 1000)
但没有出现错误"Error: Requested Resource Too Large to Return [responseTooLarge]"
,可能是相关的问题here,但是我对完成工作的任何工具都感兴趣:我已经尝试了概述的here解决方案,但是失败了
如何从BigQuery将大型数据集加载到R?
答案 0 :(得分:1)
我看到有人创造了一种使这更容易的方法。涉及到一些 setup,但您可以使用 Google Storage API like so 进行下载:
## Auth is done automagically using Application Default Credentials.
## Use the following command once to set it up :
## gcloud auth application-default login --billing-project={project}
library(bigrquerystorage)
# TODO(developer): Set the project_id variable.
# project_id <- 'your-project-id'
#
# The read session is created in this project. This project can be
# different from that which contains the table.
rows <- bqs_table_download(
x = "bigquery-public-data:usa_names.usa_1910_current"
, parent = project_id
# , snapshot_time = Sys.time() # a POSIX time
, selected_fields = c("name", "number", "state"),
, row_restriction = 'state = "WA"'
# , as_tibble = TRUE # FALSE : arrow, TRUE : arrow->as.data.frame
)
sprintf("Got %d unique names in states: %s",
length(unique(rows$name)),
paste(unique(rows$state), collapse = " "))
# Replace bigrquery::bq_download_table
library(bigrquery)
rows <- bigrquery::bq_table_download("bigquery-public-data.usa_names.usa_1910_current")
# Downloading 6,122,890 rows in 613 pages.
overload_bq_table_download(project_id)
rows <- bigrquery::bq_table_download("bigquery-public-data.usa_names.usa_1910_current")
# Streamed 6122890 rows in 5980 messages.
答案 1 :(得分:0)
根据@hrbrmstr的建议,the documentation特别提到:
> #' @param page_size The number of rows returned per page. Make this smaller > #' if you have many fields or large records and you are seeing a > #' 'responseTooLarge' error.
在r-project.org的本文档中,您将在the explanation of this function (page 13)中阅读其他建议:
这将检索page_size块中的行。最适合 较小查询的结果(例如<100 MB)。对于较大的查询,它是 更好地将结果导出到存储在Google云中的CSV文件中, 使用bq命令行工具在本地下载。
答案 2 :(得分:0)
我也刚开始使用BigQuery。我认为应该是这样的。
可以从CRAN安装当前的bigrquery版本:
install.packages("bigrquery")
可以从GitHub安装最新的开发版本:
install.packages('devtools')
devtools::install_github("r-dbi/bigrquery")
用法 低级API
library(bigrquery)
billing <- bq_test_project() # replace this with your project ID
sql <- "SELECT year, month, day, weight_pounds FROM `publicdata.samples.natality`"
tb <- bq_project_query(billing, sql)
#> Auto-refreshing stale OAuth token.
bq_table_download(tb, max_results = 10)
DBI
library(DBI)
con <- dbConnect(
bigrquery::bigquery(),
project = "publicdata",
dataset = "samples",
billing = billing
)
con
#> <BigQueryConnection>
#> Dataset: publicdata.samples
#> Billing: bigrquery-examples
dbListTables(con)
#> [1] "github_nested" "github_timeline" "gsod" "natality"
#> [5] "shakespeare" "trigrams" "wikipedia"
dbGetQuery(con, sql, n = 10)
library(dplyr)
natality <- tbl(con, "natality")
natality %>%
select(year, month, day, weight_pounds) %>%
head(10) %>%
collect()
答案 3 :(得分:0)
这帮了我大忙。
# Make page_size some value greater than the default (10000)
x <- 50000
bq_table_download(tb, page_size=x)
当心,如果将page_size
设置为任意高的值(在我的情况下为100000),您将开始看到很多空行。
对于给定的表大小,正确的page_size
值应该是什么还没有找到好的“经验法则”。