我正在尝试将Redshift表中的数据读入Spark 2.0数据帧。 我的电话看起来像这样:
df = spark.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://hostname:5439/dbname?user=myuser&password=pwd&ssl=true&sslfactory=com.amazon.redshift.ssl.NonValidatingFactory") \
.option("dbtable", "myschema.mytable") \
.option('forward_spark_s3_credentials',"true") \
.option("tempdir", "s3a://mybucket/tmp2") \
.option("region", "us-east-1") \
.load()
返回ok,没有错误。 但是,当我跑
时df.collect()
我收到以下错误:
17/02/07 17:37:36 WARN Utils$: An error occurred while trying to read
the S3 bucket lifecycle configuration
java.lang.IllegalArgumentException: Invalid S3 URI: hostname does not
appear to be a valid S3 endpoint: s3://mybucket/tmp2
at com.amazonaws.services.s3.AmazonS3URI.<init>(AmazonS3URI.java:65)
at com.amazonaws.services.s3.AmazonS3URI.<init>(AmazonS3URI.java:42)
at com.databricks.spark.redshift.Utils$.checkThatBucketHasObjectLifecycleConfiguration(Utils.scala:72)
at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:76)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$anonfun$11.apply(DataSourceStrategy.scala:336)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$anonfun$11.apply(DataSourceStrategy.scala:336)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:384)
at ...
随后返回数据......
Out[2]: [Row(col1=1, col2=u'yyyyy', col3=datetime.date(2015, 1, 6), col4=datetime.date(2017, 1, 6), col5=Decimal('21'), col6=u'ABCDEF',...)]
注意事项:
版本 Spark是2.1,jars目录包含这些相关文件:
RedshiftJDBC4-1.2.1.1001.jar
AWS-java的SDK-1.7.4.jar
火花redshift_2.11-0.5.0.jar
Hadoop的AWS-2.7.3.jar
我尝试了aws-java的其他组合esp,但在这种情况下,我甚至不会让数据帧返回。我从spark.read调用中得到一个错误。
非常感谢任何帮助/建议。
答案 0 :(得分:0)
检查铲斗和红移DB是否在同一个aws区域?