使用tidyverse从S3存储桶读取数据

时间:2020-04-07 22:38:37

标签: r amazon-s3 amazon-sagemaker readr

我正在尝试读取存储在s3存储桶中的.csv文件,但出现错误。我正在按照here的说明进行操作,但是要么不起作用,要么我犯了一个错误,而我没有明白我在做错什么。

这就是我想要做的:

# I'm working on a SageMaker notebook instance
library(reticulate)
library(tidyverse)

sagemaker <- import('sagemaker')
sagemaker.session <- sagemaker$Session()

region <- sagemaker.session$boto_region_name
bucket <- "my-bucket"
prefix <- "data/staging"
bucket.path <- sprintf("https://s3-%s.amazonaws.com/%s", region, bucket)
role <- sagemaker$get_execution_role()

client <- sagemaker.session$boto_session$client('s3')
key <- sprintf("%s/%s", prefix, 'my_file.csv')

my.obj <- client$get_object(Bucket=bucket, Key=key)

my.df <- read_csv(my.obj$Body) # This is where it all breaks down:
## 
## Error: `file` must be a string, raw vector or a connection.
## Traceback:
## 
## 1. read_csv(my.obj$Body)
## 2. read_delimited(file, tokenizer, col_names = col_names, col_types = col_types, 
##  .     locale = locale, skip = skip, skip_empty_rows = skip_empty_rows, 
##  .     comment = comment, n_max = n_max, guess_max = guess_max, 
##  .     progress = progress)
## 3. col_spec_standardise(data, skip = skip, skip_empty_rows = skip_empty_rows, 
##  .     comment = comment, guess_max = guess_max, col_names = col_names, 
##  .     col_types = col_types, tokenizer = tokenizer, locale = locale)
## 4. datasource(file, skip = skip, skip_empty_rows = skip_empty_rows, 
##  .     comment = comment)
## 5. stop("`file` must be a string, raw vector or a connection.", 
##  .     call. = FALSE)

在使用Python时,我可以使用如下方式读取CSV文件:

import pandas as pd
# ... Lots of boilerplate code
my_data = pd.read_csv(client.get_object(Bucket=bucket, Key=key)['Body'])

这与我在R中尝试执行的操作非常相似,并且可以在Python中使用...为什么在R上不起作用?

你能指出我正确的道路吗?

注意:尽管我可以为此使用Python内核,但我还是要坚持使用R,因为我比Python更熟练,至少在数据帧方面紧缩。

1 个答案:

答案 0 :(得分:1)

我建议改用aws.s3软件包:

https://github.com/cloudyr/aws.s3

非常简单-设置环境变量:

Sys.setenv("AWS_ACCESS_KEY_ID" = "mykey",
           "AWS_SECRET_ACCESS_KEY" = "mysecretkey",
           "AWS_DEFAULT_REGION" = "us-east-1",
           "AWS_SESSION_TOKEN" = "mytoken")

,然后一旦出现问题:

aws.s3::s3read_using(read.csv, object = "s3://bucket/folder/data.csv")

更新:我看到您也已经对boto熟悉,并尝试使用网状结构,因此在这里保留此简单包装: https://github.com/cloudyr/roto.s3

它具有出色的api,例如您打算使用的变量布局:

download_file(
  bucket = "is.rud.test", 
  key = "mtcars.csv", 
  filename = "/tmp/mtcars-again.csv", 
  profile_name = "personal"
)

read_csv("/tmp/mtcars-again.csv")