无法在redshift中保存数据帧

时间:2016-11-24 10:29:33

标签: scala apache-spark amazon-s3 amazon-redshift

我正在读取大型数据集表单hdfs位置并将我的数据帧保存到redshift。

df.write
  .format("com.databricks.spark.redshift")
  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
  .option("dbtable", "my_table_copy")
  .option("tempdir", "s3n://path/for/temp/data")
  .mode("error")
  .save()

一段时间后,我收到以下错误

s3.amazonaws.com:443 failed to respond
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
at org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:223)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:334)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:281)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestPut(RestStorageService.java:1043)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.copyObjectImpl(RestStorageService.java:2029)
at org.jets3t.service.StorageService.copyObject(StorageService.java:871)
at org.jets3t.service.StorageService.copyObject(StorageService.java:916)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.copy(Jets3tNativeFileSystemStore.java:323)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.rename(NativeS3FileSystem.java:707)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:370)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:

我在github上发现了同样的问题

s3.amazonaws.com:443 failed to respond

我做错了什么? 帮帮我PLZ

1 个答案:

答案 0 :(得分:2)

在我的案例中,我也遇到了同样的问题,我也在使用AWS EMR。

Redshift databricks library使用Amazon S3有效地将数据传入和传出RedshiftSpark。该库首先在Amazon S3中写入数据,然后使用EMRFS将这些avro文件加载到Redshift中。

您必须配置EMRFS设置,它才能正常工作。

  

EMR文件系统(EMRFS)和Hadoop分布式文件系统   (HDFS)都安装在您的EMR群集上。 EMRFS是一个   HDFS的实现,允许EMR集群存储数据   亚马逊S3。

     

EMRFS将尝试验证其中跟踪的对象的列表一致性   特定重试次数的元数据(emrfs-retry-logic)。默认值为5.在   重试次数超过原始作业的情况   返回失败。要解决此问题,您可以覆盖您的   以下步骤中的默认emrfs配置:

步骤1:登录您的EMR主实例

步骤2:将以下属性添加到/usr/share/aws/emr/emrfs/conf/emrfs-site.xml

sudo vi /usr/share/aws/emr/emrfs/conf/emrfs-site.xml              fs.s3.consistent.throwExceptionOnInconsistency         假     

<property>
    <name>fs.s3.consistent.retryPolicyType</name>
    <value>fixed</value>
</property>
<property>
    <name>fs.s3.consistent.retryPeriodSeconds</name>
    <value>10</value>
</property>
<property>
    <name>fs.s3.consistent</name>
    <value>false</value>
</property>

重新启动EMR群集

并配置你的hadoopConfiguration hadoopConf.set(“fs.s3a.attempts.maximum”,“30”)

val hadoopConf = SparkDriver.getContext.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3a.attempts.maximum", "30")
hadoopConf.set("fs.s3n.awsAccessKeyId", awsAccessKeyId)
hadoopConf.set("fs.s3n.awsSecretAccessKey", awsSecretAccessKey)