我有一个250 mb的镶木地板文件
其中一个单元的数据有误。我假设没有架构问题,但是有长度问题。当我跳过阅读本专栏时,就可以通过spark读取文件。 当我尝试读取该列时,spark内存不足。我尝试将100gb的内存提供给执行者,但仍然失败
此文件中有58k行。有没有办法恢复其余数据并忽略该1行/ 1个单元格?
该列名为meta
,类型为struct<name:String,schema_version:string>
我确实尝试过转换为json,然后跳过该行,但是转换为json失败
spark上的堆栈跟踪:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 10.0 failed 4 times, most recent failure: Lost task 7.3 in stage 10.0 (TID 157, ip-10-1-131-191.us-west-2.compute.internal, executor 22): ExecutorLostFailure (executor 22 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 11.6 GB of 11.1 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
由于我们已经隔离到特定文件,因此我们尝试执行以下操作:
parquet-tools cat /Users/gaurav/Downloads/part-00142-84bd71d5-268e-4db2-a962-193c171ed889.c000.snappy.parquet > ~/Downloads/parquue_2.json
java.lang.OutOfMemoryError: Java heap space
镶木地板转储
parquet-tools dump -c meta part-00142-84bd71d5-268e-4db2-a962-193c171ed889.c000.snappy.parquet
row group 0
--------------------------------------------------------------------------------
row group 1
--------------------------------------------------------------------------------