我在S3中有一个明确修剪过的架构结构,导致read.parquet()
时出现以下错误:
Caused by: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths
s3a://leftout/for/security/dashboard/updateddate=20170217
s3a://leftout/for/security/dashboard/updateddate=20170218
(冗长的)错误告诉我更进一步......
If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table.
但是,我无法使用SparkR::read.parquet(...)
找到有关如何执行此操作的任何文档。有人知道如何在R(使用SparkR)中执行此操作吗?
> version
platform x86_64-redhat-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 2.2
year 2015
month 08
day 14
svn rev 69053
language R
version.string R version 3.2.2 (2015-08-14)
nickname Fire Safety
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Amazon Linux AMI 2016.09
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.6.0 SparkR_2.0.2 DT_0.2 jsonlite_1.2 shinythemes_1.1.1 ggthemes_3.3.0
[7] dplyr_0.5.0 ggplot2_2.2.1 leaflet_1.0.1 shiny_1.0.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 magrittr_1.5 munsell_0.4.3 colorspace_1.3-2 xtable_1.8-2 R6_2.2.0
[7] stringr_1.1.0 plyr_1.8.4 tools_3.2.2 grid_3.2.2 gtable_0.2.0 DBI_0.5-1
[13] sourcetools_0.1.5 htmltools_0.3.5 yaml_2.1.14 lazyeval_0.2.0 digest_0.6.12 assertthat_0.1
[19] tibble_1.2 htmlwidgets_0.8 mime_0.5 stringi_1.1.2 scales_0.4.1 httpuv_1.3.3
答案 0 :(得分:2)
In Spark 2.1 or later您可以将basePath
作为命名参数传递:
read.parquet(path, basePath="s3a://leftout/for/security/dashboard/")
省略号捕获的参数将使用varargsToStrEnv
和used as options
进行转换。
完整会话示例:
写入数据(Scala):
Seq(("a", 1), ("b", 2)).toDF("k", "v")
.write.partitionBy("k").mode("overwrite").parquet("/tmp/data")
读取数据(SparkR):
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
SparkSession available as 'spark'.
> paths <- dir("/tmp/data/", pattern="*parquet", full.names=TRUE, recursive=TRUE)
> read.parquet(paths, basePath="/tmp/data")
SparkDataFrame[v:int, k:string]
相反,没有basePath
:
> read.parquet(paths)
SparkDataFrame[v:int]
答案 1 :(得分:0)
这就像我来的一样接近。来自source code:
read.parquet.default <- function(path, ...) {
sparkSession <- getSparkSession()
options <- varargsToStrEnv(...)
# Allow the user to have a more flexible definiton of the Parquet file path
paths <- as.list(suppressWarnings(normalizePath(path)))
read <- callJMethod(sparkSession, "read")
read <- callJMethod(read, "options", options)
sdf <- handledCallJMethod(read, "parquet", paths)
dataFrame(sdf)
}
此方法也可用here,但也会抛出unused argument
错误:
read.parquet(..., options=c(basePath="foo"))