你能帮助我使用spark + redshift +数据砖驱动程序,读取数据吗?
现在我收到错误调用read方法。下面是我的一段代码。
df = spark.read.format("com.databricks.spark.redshift")
.option("url",redshifturl).option("dbtable", "PG_TABLE_DEF")
.option("tempdir","s3n://KEY_ID:SECRET_KEY_ID@/S2_BUCKET_NAME/TEMP_FOLDER_UNDER_S3_BUCKET/")
.option("aws_iam_role","AWS_IAM_ROLE").load()
以下是我收到的错误日志
IllegalArgumentException: u"The bucket name parameter must be specified when requesting a bucket's location"
---------------------------------------------------------------------------
IllegalArgumentException Traceback (most recent call last)
<command-3255625043609925> in <module>()
----> 1 df = spark.read .format("com.databricks.spark.redshift") .option("url", redshifturl) .option("dbtable", "pg_table_def") .option("tempdir", "s3n://AKIAJXVW3IESJSQUTCUA:kLHR85WfcieNrd7B7Rm/1FK1JU4NeKTrpe8BkLbx@/weatherpattern/temp/") .option("aws_iam_role", "arn:aws:iam::190137980335:user/user1") .load()
/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
163 return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
164 else:
--> 165 return self._df(self._jreader.load())
166
167 @since(1.4)
/databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco
IllegalArgumentException: u"The bucket name parameter must be specified when requesting a bucket's location"
我认为s3n路径存在一些问题,但我提供的方式.option方法看起来是正确的,我的真实凭据。
任何建议都将不胜感激。
由于
-
答案 0 :(得分:0)
您的路径网址不正确,
格式应该是,
S3N:// ACCESSKEY:秘密密钥@桶/路径/到/温度/ DIR
df = spark.read.format("com.databricks.spark.redshift")
.option("url",redshifturl).option("dbtable", "PG_TABLE_DEF")
.option("tempdir","s3n://KEY_ID:SECRET_KEY_ID@S2_BUCKET_NAME/TEMP_FOLDER_UNDER_S3_BUCKET/")
.option("aws_iam_role","AWS_IAM_ROLE").load()
<强>文档强>
https://github.com/databricks/spark-redshift
希望它有所帮助。