我正在尝试将Spark DF编写为ORC文件,它抛出以下错误。我正在获取IndexOutOfBoundsException。
日志:
Caused by: org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:270)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:189)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:188)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 116, Size: 116
at java.util.ArrayList.rangeCheck(ArrayList.java:657)
at java.util.ArrayList.get(ArrayList.java:433)
at org.apache.hadoop.hive.ql.io.orc.OrcStruct$OrcStructInspector.<init>(OrcStruct.java:196)
at org.apache.hadoop.hive.ql.io.orc.OrcStruct.createObjectInspector(OrcStruct.java:549)
at org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:109)
at org.apache.spark.sql.hive.orc.OrcSerializer.<init>(OrcFileFormat.scala:188)
at org.apache.spark.sql.hive.orc.OrcOutputWriter.<init>(OrcFileFormat.scala:231)
at org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:91)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.org$apache$spark$sql$execution$datasources$FileFormatWriter$DynamicPartitionWriteTask$$newOutputWriter(FileFormatWriter.scala:416)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$2.apply(FileFormatWriter.scala:449)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$2.apply(FileFormatWriter.scala:438)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.sql.catalyst.util.AbstractScalaRowIterator.foreach(AbstractScalaRowIterator.scala:26)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:438)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:254)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1371)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:259)
... 8 more
答案 0 :(得分:1)
您可以添加有关尝试写入ORC方式的更多详细信息吗?
一般做法是,如果您正在读取带有架构的数据(例如,蜂巢表)的文本格式。您将使用以下直接API
df.write().format(‘orc’).save(‘/tmp/output’)
如果没有架构,则可能直接从hdfs或流式应用程序读取数据。您必须定义架构并创建数据框。
spark.read.csv(path, schema)
Val schema = StructType([
StructField(‘colName1’, StringType(), false)
])
或者,如果您有RDD,则必须将RDD [ANY]转换为RDD [Row]行,并定义架构并将其转换为数据框。
df = spark.convertDataFrame(rdd_of_rows, schema)
df.write().format('orc').save('/tmp/output')