SparkSession CSV加载有时不起作用

时间:2019-04-28 18:02:16

标签: csv apache-spark

我有3个CSV文件,以;分隔

我能够使用SparkSession CSV加载程序读取其中两个文件,但是第三个文件会导致ArrayIndexOutOfBoundsException。当我手动检查这三个文件时,它们似乎都遵循完全相同的格式,因此我很难确定导致SparkSession的CSV加载程序失败的原因。看起来Spark还没有为我提供任何调试CSV加载器的方法。

这是适用于3个文件中2个的代码:

spark
    .read
    .format("csv")
    .option("delimiter", ";")
    .load("data/bx/BX-Users.csv")

堆栈跟踪的一部分:

Caused by: java.lang.ArrayIndexOutOfBoundsException: 62
at org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

BX-Users.csv的一部分:

"1";"nyc, new york, usa";"NULL"
"2";"stockton, california, usa";"18"
"3";"moscow, yukon territory, russia";"NULL"
"4";"porto, v.n.gaia, portugal";"17"

0 个答案:

没有答案