亚马逊EMR上的Spark:“超时等待来自池的连接”

时间:2016-08-27 21:36:41

标签: apache-spark amazon-emr

我在小型三台服务器Amazon EMR 5(Spark 2.0)群集上运行Spark作业。我的工作运行了一个小时左右,因以下错误而失败。我可以手动重启并运行,处理更多数据,最终再次失败。

我的Spark代码非常简单,并且没有直接使用任何Amazon或S3 API。我的Spark代码将S3文本字符串路径传递给Spark,Spark在内部使用S3。

My Spark程序在循环中执行以下操作:从S3加载数据 - >过程 - >将数据写入S3上的不同位置。

我的第一个怀疑是,某些内部Amazon或Spark代码未正确处理连接,并且连接池已耗尽。

com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.AmazonClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
            at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:618)
            at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
            at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
            at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
            at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826)
            at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015)
            at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991)
            at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212)
            at sun.reflect.GeneratedMethodAccessor45.invoke(Unknown Source)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:498)
            at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
            at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
            at com.sun.proxy.$Proxy44.retrieveMetadata(Unknown Source)
            at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780)
            at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1428)
            at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:313)
            at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:85)
            at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
            at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
            at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
            at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
            at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
            at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
            at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
            at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
            at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
            at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
            at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
            at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487)
            at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
            at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
            at sun.reflect.GeneratedMethodAccessor85.invoke(Unknown Source)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:498)
            at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
            at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
            at py4j.Gateway.invoke(Gateway.java:280)
            at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
            at py4j.commands.CallCommand.execute(CallCommand.java:79)
            at py4j.GatewayConnection.run(GatewayConnection.java:211)
            at java.lang.Thread.run(Thread.java:745)
    Caused by: com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
            at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226)
            at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195)
            at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:498)
            at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
            at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy45.getConnection(Unknown Source)
            at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423)
            at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
            at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
            at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
            at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837)
            at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
            ... 41 more

2 个答案:

答案 0 :(得分:9)

我在EMR上有一个非常简单的程序遇到了这个问题(从S3读取数据,过滤,写入S3)。

我可以通过使用S3A文件系统实现并将fs.s3a.connection.maximum设置为100来获得更大的连接池。 (默认为15;有关更多配置属性,请参阅Hadoop-AWS module: Integration with Amazon Web Services

这是我设置配置的方式:

// in Scala
val hc = sc.hadoopConfiguration

// in Python (not tested)
hc = sc._jsc.hadoopConfiguration()

// setting the config is the same for both languages
hc.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hc.setInt("fs.s3a.connection.maximum", 100)

为了使其工作,传递给Spark的S3 URI必须以s3a://...

开头

答案 1 :(得分:1)

也可以通过将fs.s3.maxConnections设置为大于emrfs-site config中的默认值500来解决此问题,同时保留在EMRFS上

https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/