通过推断架构读取csv文件时出现sparklyr异常:double

时间:2017-03-21 09:17:48

标签: r apache-spark sparklyr

我正在尝试使用spark_read_csv函数将csv读入Spark。我在推断模式时遇到异常,但是当我设置infer_schema=TRUE时我得到了异常。

spark_read_csv(sc,"myDf",DatasetUrl)

我收到以下异常:

  

错误:org.apache.spark.SparkException:作业因阶段失败而中止:阶段90.0中的任务0失败1次,最近失败:阶段90.0中丢失的任务0.0(TID 151,localhost):java.text。 ParseException:无法解析的数字:“cr1_fd_dttm”       at java.text.NumberFormat.parse(NumberFormat.java:385)       at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast $$ anonfun $ castTo $ 4.apply $ mcD $ sp(CSVInferSchema.scala:259)

但是,当我尝试设置infer_schema=FALSE时,正如预期的那样,所有内容都被视为chr类型。

这就是cr1_fd_dttm列中数据的样子:

      cr1_fd_dttm
            <chr>
1             0.0
2   1.45679112E12
3   1.45679166E12
4   1.45679154E12
5   1.45679274E12
6             0.0
7             0.0
8             0.0
9             0.0
10  1.45679118E12

有人可以帮助我吗?

由于

1 个答案:

答案 0 :(得分:1)

我只是读取文件而不立即将其放入内存,强制字段数字,然后将这些结果加载到内存中。因此,关键是将memory设置为FALSE,将infer_schema设置为FALSE,传递列的列表,强制,然后使用compute()将结果保存到Spark内存中。这是一个冗长的,但有效的例子:

mapped_flights <- spark_read_csv(sc, "mapped_flights", 
                      path =  "s3a://flights-data/full", 
                      memory = FALSE, 
                      infer_schema = FALSE,
                      columns = list(
                        Year = "character",
                        Month = "character",
                        DayofMonth = "character",
                        DayOfWeek = "character",
                        DepTime = "character",
                        CRSDepTime = "character",
                        ArrTime = "character",
                        CRSArrTime = "character",
                        UniqueCarrier = "character",
                        FlightNum = "character",
                        TailNum = "character",
                        ActualElapsedTime = "character",
                        CRSElapsedTime = "character",
                        AirTime = "character",
                        ArrDelay = "character",
                        DepDelay = "character",
                        Origin = "character",
                        Dest = "character",
                        Distance = "character",
                        TaxiIn = "character",
                        TaxiOut = "character",
                        Cancelled = "character",
                        CancellationCode = "character",
                        Diverted = "character",
                        CarrierDelay = "character",
                        WeatherDelay = "character",
                        NASDelay = "character",
                        SecurityDelay = "character",
                        LateAircraftDelay = "character")
                      )


flights <- mapped_flights %>%   mutate(
Year = as.integer(Year),
Month = as.integer(Month),
DayofMonth = as.integer(DayofMonth),
DayOfWeek = as.integer(DayOfWeek),
DepTime = as.integer(DepTime),
CRSDepTime = as.integer(CRSDepTime),
CRSArrTime = as.integer(CRSArrTime),
ArrTime = as.integer(ArrTime),
ActualElapsedTime = as.integer(ActualElapsedTime),
CRSElapsedTime = as.integer(CRSElapsedTime),
AirTime = as.integer(AirTime),
ArrDelay = as.double(ArrDelay),
DepDelay = as.double(DepDelay),
Distance = as.integer(Distance),
TaxiIn = as.integer(TaxiIn),
TaxiOut = as.integer(TaxiOut),
Cancelled = as.integer(Cancelled),
Diverted = as.integer(Diverted),
CarrierDelay = as.integer(CarrierDelay),
WeatherDelay = as.integer(WeatherDelay),
NASDelay = as.integer(NASDelay),
SecurityDelay = as.integer(SecurityDelay),
LateAircraftDelay = as.integer(LateAircraftDelay)) %>%   compute("flights")