如果日期列

时间:2016-07-08 11:36:02

标签: r date csv apache-spark-sql sparkr

尝试将CS​​V文件读取到包含此数据行的Spark(使用SparkR):

1|1998-01-01||

使用Spark 1.6.2(Hadoop 2.6)给我

> head(sdf)
  id          d dtwo
1  1 1998-01-01   NA

Spark 2.0预览版(Hadoop 2.7,Rev。14308)失败,错误:

  

invokeJava出错(isStatic = TRUE,className,methodName,...):     org.apache.spark.SparkException:由于阶段失败而中止作业:阶段0.0中的任务0失败1次,最近失败:阶段0.0中丢失的任务0.0(TID 0,localhost):java.text.ParseException:Unparseable date: “”       at java.text.DateFormat.parse(DateFormat.java:357)       at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast $ .castTo(CSVInferSchema.scala:289)       在org.apache.spark.sql.execution.datasources.csv.CSVRelation $$ anonfun $ csvParser $ 3.apply(CSVRelation.scala:98)       在org.apache.spark.sql.execution.datasources.csv.CSVRelation $$ anonfun $ csvParser $ 3.apply(CSVRelation.scala:74)       在org.apache.spark.sql.execution.datasources.csv.DefaultSource $$ anonfun $ buildReader $ 1 $$ anonfun $ apply $ 1.apply(DefaultSource.scala:124)       在org.apache.spark.sql.execution.datasources.csv.DefaultSource $$ anonfun $ buildReader $ 1 $$ anonfun $ apply $ 1.apply(DefaultSource.scala:124)       在scala.collection.Iterator $$ anon $ 12.nextCur(Iterator.scala:434)       在scala.collection.Iterator $$ anon $ 12.hasNext(Itera ...

问题似乎确实是NULL值,因为它在第三个CSV列中有效日期。

R代码:

#Sys.setenv(SPARK_HOME = 'c:/spark/spark-1.6.2-bin-hadoop2.6') 
Sys.setenv(SPARK_HOME = 'C:/spark/spark-2.0.0-preview-bin-hadoop2.7')
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)

sc <-
    sparkR.init(
        master = "local",
        sparkPackages = "com.databricks:spark-csv_2.11:1.4.0"
    )
sqlContext <- sparkRSQL.init(sc)


st <- structType(structField("id", "integer"), structField("d", "date"), structField("dtwo", "date"))

sdf <- read.df(
    sqlContext,
    path = "d:/date_test.csv",
    source = "com.databricks.spark.csv",
    schema = st,
    inferSchema = "false",
    delimiter = "|",
    dateFormat = "yyyy-MM-dd",
    nullValue = "",
    mode = "PERMISSIVE"
)

head(sdf)

sparkR.stop()

知道问题是什么吗?是否应该打开错误报告? (我对Spark很缺乏经验,所以我认为我可能做错了......)

0 个答案:

没有答案