尝试将CSV文件读取到包含此数据行的Spark(使用SparkR):
1|1998-01-01||
使用Spark 1.6.2(Hadoop 2.6)给我
> head(sdf)
id d dtwo
1 1 1998-01-01 NA
Spark 2.0预览版(Hadoop 2.7,Rev。14308)失败,错误:
invokeJava出错(isStatic = TRUE,className,methodName,...): org.apache.spark.SparkException:由于阶段失败而中止作业:阶段0.0中的任务0失败1次,最近失败:阶段0.0中丢失的任务0.0(TID 0,localhost):java.text.ParseException:Unparseable date: “” at java.text.DateFormat.parse(DateFormat.java:357) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast $ .castTo(CSVInferSchema.scala:289) 在org.apache.spark.sql.execution.datasources.csv.CSVRelation $$ anonfun $ csvParser $ 3.apply(CSVRelation.scala:98) 在org.apache.spark.sql.execution.datasources.csv.CSVRelation $$ anonfun $ csvParser $ 3.apply(CSVRelation.scala:74) 在org.apache.spark.sql.execution.datasources.csv.DefaultSource $$ anonfun $ buildReader $ 1 $$ anonfun $ apply $ 1.apply(DefaultSource.scala:124) 在org.apache.spark.sql.execution.datasources.csv.DefaultSource $$ anonfun $ buildReader $ 1 $$ anonfun $ apply $ 1.apply(DefaultSource.scala:124) 在scala.collection.Iterator $$ anon $ 12.nextCur(Iterator.scala:434) 在scala.collection.Iterator $$ anon $ 12.hasNext(Itera ...
问题似乎确实是NULL值,因为它在第三个CSV列中有效日期。
R代码:
#Sys.setenv(SPARK_HOME = 'c:/spark/spark-1.6.2-bin-hadoop2.6')
Sys.setenv(SPARK_HOME = 'C:/spark/spark-2.0.0-preview-bin-hadoop2.7')
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
sc <-
sparkR.init(
master = "local",
sparkPackages = "com.databricks:spark-csv_2.11:1.4.0"
)
sqlContext <- sparkRSQL.init(sc)
st <- structType(structField("id", "integer"), structField("d", "date"), structField("dtwo", "date"))
sdf <- read.df(
sqlContext,
path = "d:/date_test.csv",
source = "com.databricks.spark.csv",
schema = st,
inferSchema = "false",
delimiter = "|",
dateFormat = "yyyy-MM-dd",
nullValue = "",
mode = "PERMISSIVE"
)
head(sdf)
sparkR.stop()
知道问题是什么吗?是否应该打开错误报告? (我对Spark很缺乏经验,所以我认为我可能做错了......)