Question

我尝试使用spark_read_csv将大型数据库复制到Spark中，但我收到以下错误作为输出：

错误：org.apache.spark.SparkException：作业由于阶段而中止失败：阶段16.0中的任务0失败4次，最近失败：阶段16.0中失去的任务0.3（TID 176,10.1.2.235）： java.lang.IllegalArgumentException：要求失败：十进制精度8超过最大精度7

data_tbl <- spark_read_csv(sc, "data", "D:/base_csv", delimiter = "|", overwrite = TRUE)

这是一个大数据集，大约有580万条记录，我的数据集中包含Int，num和chr类型的数据。

Answer 1

我认为您有几个选项，具体取决于您使用的火花版本

Spark＆gt; = 1.6.1

从这里开始：https://docs.databricks.com/spark/latest/sparkr/functions/read.df.html 看来，您可以专门指定您的架构以强制它使用双打

csvSchema <- structType(structField("carat", "double"), structField("color", "string"))
diamondsLoadWithSchema<- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
                                 source = "csv", header="true", schema = csvSchema)

Spark＆lt; 1.6.1 考虑test.csv

1,a,4.1234567890 2,b,9.0987654321

你可以轻松地提高效率，但我认为你得到了要点

linesplit <- function(x){ tmp <- strsplit(x,",") return ( tmp) } lineconvert <- function(x){ arow <- x[[1]] converted <- list(as.integer(arow[1]), as.character(arow[2]),as.double(arow[3])) return (converted) } rdd <- SparkR:::textFile(sc,'/path/to/test.csv') lnspl <- SparkR:::map(rdd, linesplit) ll2 <- SparkR:::map(lnspl,lineconvert) ddf <- createDataFrame(sqlContext,ll2) head(ddf) _1 _2 _3 1 1 a 4.1234567890 2 2 b 9.0987654321

注意：SparkR :::方法是私有的，有些原因，文档说“使用它时要小心”＆＃39;

Sparklyr - 十进制精度8超过最大精度7

1 个答案: