Question

我可以通过使用sparklyr或使用不同的火花套装在线找到许多答案，这实际上需要旋转火花簇，这是一个开销。在python中，我可以找到一种方法来使用＆＃34; pandas.read_parquet＆＃34;或python中的Apache箭头 - 我正在寻找类似的东西。

Answer 1

有了网状结构，您可以使用python中的熊猫来读取实木复合地板文件。这样可以避免您运行Spark实例的麻烦。在apache arrow发布其版本之前，可能会失去序列化性能。如上所述，提到评论。

library(reticulate)
library(dplyr)
pandas <- import("pandas")
read_parquet <- function(path, columns = NULL) {

  path <- path.expand(path)
  path <- normalizePath(path)

  if (!is.null(columns)) columns = as.list(columns)

  xdf <- pandas$read_parquet(path, columns = columns)

  xdf <- as.data.frame(xdf, stringsAsFactors = FALSE)

  dplyr::tbl_df(xdf)

}

read_parquet(PATH_TO_PARQUET_FILE)

Answer 2

您可以简单地使用arrow软件包：

install.packages("arrow")
library(arrow)
read_parquet("myfile.parquet")

如何在不使用spark包的情况下读取R中的镶木地板文件？

2 个答案: