Question

当我使用sparkSql接口从HDFS读取数据时，某些任务抛出java.lang.ArrayIndexOutOfBoundsException。我认为数据集中可能存在一些不良记录，导致任务失败。我怎么能得到糟糕的记录？或者，当我使用spark接口加载数据以使应用程序成功时，如何忽略错误记录？

失败任务中的完整错误日志发布在下面（似乎有些utf8解码错误）：

17/06/17 23:02:19 ERROR Executor: Exception in task 42.0 in stage 0.0 (TID 42)
java.lang.ArrayIndexOutOfBoundsException: 62
    at org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:156)
    at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:171)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

我查找utf8编码和火花源代码（发布在下面）。根据utf8编码，utf8中一个字符的长度必须介于1和6之间。因此，最大可用代码点为11111101b。所以＆＃39;偏移＆＃39;火花源代码中的变量必须不超过11111101b - 192 = 61.数据中应该有一些不良记录，这对utf8编码是非法的。

那么如何选择它们呢？或者我怎样才能跳过不良记录？

private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
    4, 4, 4, 4, 4, 4, 4, 4,
    5, 5, 5, 5,
    6, 6};


  private static int numBytesForFirstByte(final byte b) {
    final int offset = (b & 0xFF) - 192;
    return (offset >= 0) ? bytesOfCodePointInUTF8[offset] : 1;
  }

Answer 1

您似乎（通过agg_doAggregateWithKeys猜测）使用类型化数据集API。

我建议使用Dataset.rdd访问基础RDD[InternalRow]并直接使用UnsafeRows查看可能导致此问题的字符串。

请勿触摸任何可以使用编码器转换数据集的方法（这样可以避免使用UTF8String进行转换。）

Answer 2

在下面尝试，option mutiline =true解决此问题

  val data = spark.read.option("header","false").
      option("delimiter", "|").
      **option("multiline", "true").**
          csv("test.unl")

如何修复spark unsafe.types.UTF8String.numBytesForFirstByte抛出java.lang.ArrayIndexOutOfBoundsException？

2 个答案: