R从S3读取ORC文件

时间:2017-03-22 15:04:15

标签: r amazon-s3 sparkr orc

我们将在AWS上运行一个运行在S3存储桶之上的EMR集群(带有现场实例)。数据将以ORC格式存储在此存储桶中。但是,我们希望使用R以及某种沙盒环境,读取相同的数据。

我已经正确运行了aws.s3(cloudyr)软件包:我可以毫无问题地阅读csv文件,但似乎不允许我将orc文件转换为可读的内容。

我在网上找到的两个选项是 - SparkR - dataconnector(vertica)

由于在Windows机器上安装dataconnector是一个问题,我安装了SparkR,现在我可以读取本地的orc.file(我的机器上的本地R,我机器上的本地orc文件)。但是,如果我尝试read.orc,它默认将我的路径规范化为本地路径。深入研究源代码,我运行了以下内容:

sparkSession <- SparkR:::getSparkSession()
options <- SparkR:::varargsToStrEnv()
read <- SparkR:::callJMethod(sparkSession, "read")
read <- SparkR:::callJMethod(read, "options", options)
sdf <- SparkR:::handledCallJMethod(read, "orc", my_path)

但我收到了以下错误:

Error: Error in orc : java.io.IOException: No FileSystem for scheme: https

有人可以帮我解决这个问题,还是指出从S3加载orc文件的替代方法?

1 个答案:

答案 0 :(得分:2)

编辑回答:现在您可以直接从S3阅读,而不是先从本地文件系统下载和阅读

根据mrjoseph的要求:通过SparkR的可能解决方案(首先我不想这样做)。

# Set the System environment variable to where Spark is installed
Sys.setenv(SPARK_HOME="pathToSpark")
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "org.apache.hadoop:hadoop-aws:2.7.1" "sparkr-shell"')

# Set the library path to include path to SparkR
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"), .libPaths()))

# Set system environments to be able to load from S3
Sys.setenv("AWS_ACCESS_KEY_ID" = "myKeyID", "AWS_SECRET_ACCESS_KEY" = "myKey", "AWS_DEFAULT_REGION" = "myRegion")

# load required packages
library(aws.s3)
library(SparkR)

## Create a spark context and a sql context
sc<-sparkR.init(master = "local")
sqlContext<-sparkRSQL.init(sc)

# Set path to file
path <- "s3n://bucketname/filename.orc"

# Set hadoop configuration
hConf = SparkR:::callJMethod(sc, "hadoopConfiguration")
SparkR:::callJMethod(hConf, "set", "fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
SparkR:::callJMethod(hConf, "set", "fs.s3n.awsAccessKeyId", "myAccesKey")
SparkR:::callJMethod(hConf, "set", "fs.s3n.awsSecretAccessKey", "mySecrectKey")

# Slight adaptation to read.orc function
sparkSession <- SparkR:::getSparkSession()
options <- SparkR:::varargsToStrEnv()
# Not required: path <- normalizePath(path)
read <- SparkR:::callJMethod(sparkSession, "read")
read <- SparkR:::callJMethod(read, "options", options)
sdf <- SparkR:::handledCallJMethod(read, "orc", path)
temp <- SparkR:::dataFrame(sdf)

# Read first lines
head(temp)