spark - redshift - s3:类路径冲突

时间:2017-01-19 15:25:41

标签: amazon-web-services apache-spark amazon-s3 amazon-redshift databricks

我正在尝试使用hadoop 2.7.2和allxuio从AWS上的spark 2.1.0独立群集连接到redshift,这给了我这个错误:Exception in thread "main" java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager <init(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:287)

据我所知,问题出现在:

  

关于Amazon SDK依赖关系的注意事项:此库声明了对AWS Java SDK组件的提供依赖关系。在大多数情况下,这些库将由您的部署环境提供。但是,如果您获得Amazon SDK类的ClassNotFoundExceptions,则需要在com.amazonaws.aws-java-sdk-core和com.amazonaws.aws-java-sdk-s3上添加显式依赖项作为构建/运行时配置的一部分。有关更多详细信息,请参阅project / SparkRedshiftBuild.scala中的注释。

spark-redshift-databricks中所述,我尝试了所有可能的类路径jar组合和相同的错误。我放置所有罐子的火花提交如下:

/usr/local/spark/bin/spark-submit --class com.XX.XX.app.Test --driver-memory 2G --total-executor-cores 40 --verbose --jars /home/ubuntu/aws-java-sdk-s3-1.11.79.jar,/home/ubuntu/aws-java-sdk-core-1.11.79.jar,/home/ubuntu/postgresql-9.4.1207.jar,/home/ubuntu/alluxio-1.3.0-spark-client-jar-with-dependencies.jar,/usr/local/alluxio/core/client/target/alluxio-core-client-1.3.0-jar-with-dependencies.jar --master spark://XXX.eu-west-1.compute.internal:7077 --executor-memory 4G /home/ubuntu/QAe.jar qa XXX.eu-west-1.compute.amazonaws.com 100 --num-executors 10 --conf spark.executor.extraClassPath=/home/ubuntu/aws-java-sdk-s3-1.11.79.jar:/home/ubuntu/aws-java-sdk-core-1.11.79.jar --driver-class-path /home/ubuntu/aws-java-sdk-s3-1.11.79.jar:/home/ubuntu/aws-java-sdk-core-1.11.79.jar:/home/ubuntu/postgresql-9.4.1207.jar --driver-library-path /home/ubuntu/aws-java-sdk-s3-1.11.79.jar:/home/ubuntu/aws-java-sdk-core-1.11.79.jar --driver-library-path com.amazonaws.aws-java-sdk-s3:com.amazonaws.aws-java-sdk-core.jar --packages databricks:spark-redshift_2.11:3.0.0-preview1,com.amazonaws:aws-java-sdk-s3:1.11.79,com.amazonaws:aws-java-sdk-core:1.11.79

我的built.sbt:

libraryDependencies += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.8.4" 
libraryDependencies += "com.amazonaws" % "aws-java-sdk-core" % "1.11.79"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-s3" % "1.11.79"
libraryDependencies += "org.apache.avro" % "avro-mapred" % "1.8.1"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-redshift" % "1.11.78"
libraryDependencies += "com.databricks" % "spark-redshift_2.11" % "3.0.0-preview1"
libraryDependencies += "org.alluxio" % "alluxio-core-client" % "1.3.0"
libraryDependencies += "com.taxis99" %% "awsscala" % "0.7.3"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.7.3"
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion
libraryDependencies +=  "org.apache.spark" %% "spark-sql" % sparkVersion
libraryDependencies +=  "org.apache.spark" %% "spark-mllib" % sparkVersion
只需从postgresql中读取

代码并写入redshift:

   val df = spark.read.jdbc(url_read,"public.test", prop).as[Schema.Message.Raw]
  .filter("message != ''")
  .filter("from_id >= 0")
  .limit(100)


df.write
  .format("com.databricks.spark.redshift")
  .option("url", "jdbc:redshift://test.XXX.redshift.amazonaws.com:5439/test?user=test&password=testXXXXX")
  .option("dbtable", "table_test")
  .option("tempdir", "s3a://redshift_logs/")
  .option("forward_spark_s3_credentials", "true")
  .option("tempformat", "CSV")
  .option("jdbcdriver", "com.amazon.redshift.jdbc42.Driver")
  .mode(SaveMode.Overwrite)
  .save()

我在/ home / ubuntu /下的所有群集节点上都列出了jar文件。

是否有人了解如何在com.amazonaws.aws-java-sdk-core和com.amazonaws.aws-java-sdk-s3中添加显式依赖项作为spark中构建/运行时配置的一部分?或者是罐子本身的问题:我使用错误的版本1.11.80或.. 79等? 我是否需要从build.sbt中排除这些库? 将hadoop改为2.8会解决问题吗?

以下是我基于测试的有用链接: Dependency Management with SparkereAdd jars to a Spark Job - spark-submit

1 个答案:

答案 0 :(得分:3)

亚马逊倾向于更快地更改其库的API,以使所有版本的hadoop-aws.jar都需要与AWS SDK同步;对于Hadoop 2.7.x来说,这是SDK的1.7.4版本。就目前而言,您可能无法获得红移和s3a共存,但您可能能够继续使用较旧的s3n网址。

对较新SDK的更新只会在Hadoop&gt;中出现。 2.8,当它移动到1.11.45。为何如此延迟?因为这迫使杰克逊更新,最终打破了下游的其他一切。

欢迎来到传递依赖的世界JAR地狱,让我们都希望Java 9能够解决这个问题 - 虽然它需要某人(你?)来添加所有相关的模块声明