我正在尝试使用 databricks 和 pyspark 连接到我的 redshift 表,但我发现文档很难理解 (https://docs.databricks.com/data/data-sources/aws/amazon-redshift.html)。这是我到目前为止从 java.lang.NullPointerException
抛出 .option("aws_iam_role", "arn:aws:iam::946575530956:role/MY_IAM_ROLE") \
的内容:
df = spark.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift:://redshift-cluster-1.ci9fbdm1ahgn.us-east-1.redshift.amazonaws.com") \ # I installed the driver from Amazon
.option("dbtable", "suppliers") \
.option("tempdir", "s3a://spark-redshift/temp_data/") \
.option("password", "MY-PASSWORD") \
.option("user", "MY-USERNAME") \
.option("aws_iam_role", "arn:aws:iam::946575530956:role/MY-IAM-ROLE") \
.load()
如果我去掉 aws_iam_role
,那么我会收到这个错误:IllegalArgumentException: requirement failed: You must specify a method for authenticating Redshift's connection to S3 (aws_iam_role, forward_spark_s3_credentials, or temporary_aws_*. For a discussion of the differences between these options, please see the README.
,我假设他们指的是这个:https://github.com/databricks/spark-redshift/blob/master/README.md#authenticating-to-s3-and-redshift。这仍然对我没有太大帮助。
我觉得我已经设置了所有信息和权限,但我可能没有正确引用选项或其他内容。
非常感谢任何帮助,谢谢!