跳过非常大的实木复合地板

时间:2019-09-07 02:06:24

标签: apache-spark parquet

我有一个250 mb的镶木地板文件

其中一个单元的数据有误。我假设没有架构问题,但是有长度问题。当我跳过阅读本专栏时,就可以通过spark读取文件。 当我尝试读取该列时,spark内存不足。我尝试将100gb的内存提供给执行者,但仍然失败

此文件中有58k行。有没有办法恢复其余数据并忽略该1行/ 1个单元格?

该列名为meta,类型为struct<name:String,schema_version:string>

我确实尝试过转换为json,然后跳过该行,但是转换为json失败

spark上的堆栈跟踪:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 10.0 failed 4 times, most recent failure: Lost task 7.3 in stage 10.0 (TID 157, ip-10-1-131-191.us-west-2.compute.internal, executor 22): ExecutorLostFailure (executor 22 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits.  11.6 GB of 11.1 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)

由于我们已经隔离到特定文件,因此我们尝试执行以下操作:

parquet-tools cat  /Users/gaurav/Downloads/part-00142-84bd71d5-268e-4db2-a962-193c171ed889.c000.snappy.parquet > ~/Downloads/parquue_2.json
java.lang.OutOfMemoryError: Java heap space

镶木地板转储

parquet-tools dump -c meta part-00142-84bd71d5-268e-4db2-a962-193c171ed889.c000.snappy.parquet 
row group 0 
--------------------------------------------------------------------------------

row group 1 
--------------------------------------------------------------------------------

0 个答案:

没有答案