Question

我尝试使用spark_read_parquet从“表”中读取列的子集，

temp <- spark_read_parquet(sc, name='mytable',columns=c("Col1","Col2"),
                                 path="/my/path/to/the/parquet/folder")

但是我得到了错误：

Error: java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
Old column names (54): .....

我的语法正确吗？我尝试使用columns参数搜索（真实）代码示例，但找不到任何示例。

（还有我的歉意……我真的不知道如何为您提供涉及火花和云的可复制示例。）

Answer 1

TL; DR 这不是columns的工作方式。像这样在are used至rename the columns处应用时，因此其长度应等于输入的长度。

使用它的方式是（请注意memory = FALSE，这对于正常工作至关重要）

spark_read_parquet(
  sc, name = "mytable", path = "/tmp/foo", 
  memory = FALSE
) %>% select(Col1, Col2)

可选地后跟

... %>% 
  sdf_persist()

如果您有字符向量，则可以使用rlang：

library(rlang)

cols <- c("Col1", "Col2")

spark_read_parquet(sc, name="mytable", path="/tmp/foo", memory=FALSE) %>% 
  select(!!! lapply(cols, parse_quosure))

spark_read_parquet中的columns选项

1 个答案: