我正在将实木复合地板文件中约4GB的数据加载到Spark DF中。加载需要几百毫秒。然后,我将DF注册为表以执行SQL查询。
sparkDF = sqlContext.read.parquet("<path>/*.parquet")
sparkDF.registerTempTable("sparkDF")
其中之一是选择列表中有60列的选择性查询,给出了内存不足异常。
spark.sql("select <60 columns list> from sessions where endtime >= '2019-07-01 00:00:00' and endtime < '2019-07-01 03:00:00' and id = '<uuid>'").show()
[Stage 12:> (0 + 36) / 211]2019-09-16 21:18:45,583 ERROR executor.Executor: Exception in task 25.0 in stage 12.0 (TID 1608)
java.lang.OutOfMemoryError: Java heap space
当我从选择列表中删除某些列时,它已成功执行。我试图将spark.executor.memory和spark.driver.memory增加到约16g。但是这个问题无法解决。
然后,我将spark版本更新为最新的2.4.4。现在不再提供错误。
但是,当我以增量格式编写相同的DF时,使用相同的更新版本,则会出现相同的内存不足错误。
sessions.write.format("delta").save("/usr/spark-2.4.4/data/data-delta/")
[Stage 5:> (0 + 36) / 37]2019-09-18 18:58:04,362 ERROR executor.Executor: Exception in task 21.0 in stage 5.0 (TID 109)
java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.io.compress.DecompressorStream.<init>(DecompressorStream.java:64)
at org.apache.hadoop.io.compress.DecompressorStream.<init>(DecompressorStream.java:71)
at org.apache.parquet.hadoop.codec.NonBlockedDecompressorStream.<init>(NonBlockedDecompressorStream.java:36)
at org.apache.parquet.hadoop.codec.SnappyCodec.createInputStream(SnappyCodec.java:75)
at org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.decompress(CodecFactory.java:109)
at org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader$1.visit(ColumnChunkPageReadStore.java:93)
at org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader$1.visit(ColumnChunkPageReadStore.java:88)
at org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:95)
at org.apache.parquet.hadoop.ColumnChunkPageReadStore$ColumnChunkPageReader.readPage(ColumnChunkPageReadStore.java:88)
at org.apache.parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:532)
at org.apache.parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:525)
at org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:638)
at org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:353)
at org.apache.parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:80)
at org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:75)
at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:271)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165)
at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:232)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
对此任何更好的建议/改进将有助于解决问题。
答案 0 :(得分:0)
您可以增加VM可以使用的RAM数量。相应的VM选项为:
list = [function_that_returns_a_list(var) for var in vars]
我不确定这是否可以解决您的问题,但您应该尝试一下。
答案 1 :(得分:0)
使用Spark版本2.4.4,仅在运行时增加驱动程序内存,有助于解决该问题。
pyspark --packages io.delta:delta-core_2.11:0.3.0 --driver-memory 5g
答案 2 :(得分:0)
增加驱动程序和执行程序内存的解决方案是非常临时的解决方案。它还涉及并行性。驱动程序不需要16gb内存。
代替spark.sql(“从结束时间> ='2019-07-01 00:00:00'和结束时间<'2019-07-01 03:00:00'的会话中选择<60列列表> id =”“).show()您应该 使用spark.sql(“从结束时间> ='2019-07-01 00:00:00'和结束时间<'2019-07-01 03:00:00'并且id =''”的会话中选择*。) (60)