Question

我正在尝试将数据块中加载的R数据帧转换为Sparklyr数据帧，但我认为通常使用的copy_to函数无法应付文件大小。我需要转换的文件范围为780MB-4.7GB。

代码是：

chloedf<-copy_to(sc,Chloe)

并返回错误：

Error in writeBin(utfVal, con, endian = "big", useBytes = TRUE) : Error in writeBin(utfVal, con, endian = "big", useBytes = TRUE) : 
  attempting to add too many elements to raw vector
Error in writeBin(utfVal, con, endian = "big", useBytes = TRUE) : 
  attempting to add too many elements to raw vector
In addition: Warning message:
closing unused connection 11 (raw())

Answer 1

看起来像copy_to() wasn't intended for large datasets。

这里有两个选项。

将原始R数据帧另存为CSV，而不是rds格式。然后，您可以使用spark_read_csv(sc, "/path/to/mycsv.csv")将其直接读入Spark。这是最简单的方法。
尝试改用SparkR::createDataFrame()。
在您的Databricks群集上安装Apache Arrow，然后重试copy_to()命令。 Here是有关设置的一些说明。

将大的RDS文件写入Sparklyr数据帧-databricks

1 个答案: