我只是在pyspark shell中调用sqlContext.table()
然后抛出OutOfMemoryError
异常。 sqlContext.table
不是变换而非行动吗?为什么它会抛出异常?为什么它说“第0阶段包含一个非常大的任务”?
以下是详细日志:
pyspark -v --driver-memory 2G --executor-memory 2G --conf spark.yarn.executor.memoryOverhead=2048 --conf spark.kryoserializer.buffer=128m
代码&错误:
bb = sqlContext.table("piwik.ods_piwik").where("e_a='install_failure' and createdate=20160929 and a_rcn='YLZT' and idsite=10005")
[Stage 0:> (0 + 0) / 2]16/10/08 16:35:38 WARN scheduler.TaskSetManager: Stage 0 contains a task of very large size (405 KB). The maximum recommended task size is 100 KB.
Exception in thread "dispatcher-event-loop-3" java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:87)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/10/08 16:37:50 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-1581561755-10.172.248.168-1427290270505:blk_1122356810_48621761
java.io.EOFException: Premature EOF: no length prefix available
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2241)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:235)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:971)
16/10/08 16:37:59 WARN hdfs.DFSClient: Error Recovery for block BP-1581561755-10.172.248.168-1427290270505:blk_1122356810_48621761 in pipeline DatanodeInfoWithStorage[10.173.34.71:50010,DS-e1248474-daf8-47de-9a4b-b0ea4cc92aa3,DISK], DatanodeInfoWithStorage[10.51.117.181:50010,DS-5e209621-3af5-4221-af05-07458ac7827b,DISK]: bad datanode DatanodeInfoWithStorage[10.173.34.71:50010,DS-e1248474-daf8-47de-9a4b-b0ea4cc92aa3,DISK]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/alidata/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/pyspark/sql/context.py", line 593, in table
return DataFrame(self._ssql_ctx.table(tableName), self)
File "/alidata/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/alidata/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/pyspark/sql/utils.py", line 45, in deco
return f(*a, **kw)
File "/alidata/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o42.table.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.hadoop.fs.Path.initialize(Path.java:203)
at org.apache.hadoop.fs.Path.<init>(Path.java:172)
at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$21.apply(interfaces.scala:908)
at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$21.apply(interfaces.scala:906)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.sql.sources.HadoopFsRelation$.listLeafFilesInParallel(interfaces.scala:906)
at org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.listLeafFiles(interfaces.scala:445)
at org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.refresh(interfaces.scala:477)
at org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache$lzycompute(interfaces.scala:489)
at org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache(interfaces.scala:487)
at org.apache.spark.sql.sources.HadoopFsRelation.cachedLeafStatuses(interfaces.scala:494)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.refresh(ParquetRelation.scala:398)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCache$lzycompute(ParquetRelation.scala:145)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCache(ParquetRelation.scala:143)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$6.apply(ParquetRelation.scala:202)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$6.apply(ParquetRelation.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.dataSchema(ParquetRelation.scala:202)
at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:636)
at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:635)
at org.apache.spark.sql.execution.datasources.LogicalRelation.<init>(LogicalRelation.scala:37)
at org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:481)
at org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:480)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToParquetRelation(HiveMetastoreCatalog.scala:480)
at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$$anonfun$apply$1.applyOrElse(HiveMetastoreCatalog.scala:542)
at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$$anonfun$apply$1.applyOrElse(HiveMetastoreCatalog.scala:522)