Spark + Amazon S3" s3a://"网址

时间:2016-09-02 23:55:25

标签: apache-spark amazon-s3

使用" s3a://"来调用AFAIK,这是Hadoop + Spark的最新,最好的S3实现。网址协议。这适用于预先配置的Amazon EMR。

但是,当使用预先构建的spark-2.0.0-bin-hadoop2.7.tgz在本地开发系统上运行时,我得到

Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
    ... 99 more

接下来,我尝试启动指定hadoop-aws插件的Spark作业:

$SPARK_HOME/bin/spark-submit --master local \
    --packages org.apache.hadoop:hadoop-aws:2.7.3 \
    my_spark_program.py

我得到了

    ::::::::::::::::::::::::::::::::::::::::::::::

    ::              FAILED DOWNLOADS            ::

    :: ^ see resolution messages for details  ^ ::

    ::::::::::::::::::::::::::::::::::::::::::::::

    :: com.google.code.findbugs#jsr305;3.0.0!jsr305.jar

    :: org.apache.avro#avro;1.7.4!avro.jar

    :: org.xerial.snappy#snappy-java;1.0.4.1!snappy-java.jar(bundle)

    ::::::::::::::::::::::::::::::::::::::::::::::

我在一个带有这三个依赖项的临时目录中创建了一个虚拟的build.sbt项目,以查看基本的sbt构建是否可以成功下载这些并且我得到了:

[error] (*:update) sbt.ResolveException: unresolved dependency: org.apache.avro#avro;1.7.4: several problems occurred while resolving dependency: org.apache.avro#avro;1.7.4 {compile=[default(compile)]}:
[error]     org.apache.avro#avro;1.7.4!avro.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/org/apache/avro/avro/1.7.4/avro-1.7.4.pom
[error]     org.apache.avro#avro;1.7.4!avro.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/org/apache/avro/avro/1.7.4/avro-1.7.4.pom
[error] 
[error] unresolved dependency: com.google.code.findbugs#jsr305;3.0.0: several problems occurred while resolving dependency: com.google.code.findbugs#jsr305;3.0.0 {compile=[default(compile)]}:
[error]     com.google.code.findbugs#jsr305;3.0.0!jsr305.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/com/google/code/findbugs/jsr305/3.0.0/jsr305-3.0.0.pom
[error]     com.google.code.findbugs#jsr305;3.0.0!jsr305.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/com/google/code/findbugs/jsr305/3.0.0/jsr305-3.0.0.pom
[error] 
[error] unresolved dependency: org.xerial.snappy#snappy-java;1.0.4.1: several problems occurred while resolving dependency: org.xerial.snappy#snappy-java;1.0.4.1 {compile=[default(compile)]}:
[error]     org.xerial.snappy#snappy-java;1.0.4.1!snappy-java.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/org/xerial/snappy/snappy-java/1.0.4.1/snappy-java-1.0.4.1.pom
[error]     org.xerial.snappy#snappy-java;1.0.4.1!snappy-java.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/org/xerial/snappy/snappy-java/1.0.4.1/snappy-java-1.0.4.1.pom
[error] Total time: 2 s, completed Sep 2, 2016 6:47:17 PM

关于如何让这个工作的任何想法?

2 个答案:

答案 0 :(得分:1)

您的提交标记中似乎需要额外的jar。 Maven存储库有许多适用于Java的AWS包,您可以使用它来修复当前错误:https://mvnrepository.com/search?q=aws

我一直对S3A文件系统错误感到头痛;但是aws-java-sdk:1.7.4 jar适用于Spark 2.0。

可以在这里找到有关此事的进一步对话;尽管Maven AWS EC2存储库中确实存在实际的包。

https://sparkour.urizone.net/recipes/using-s3/

试试这个:

spark-submit --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 my_spark_program.py

答案 1 :(得分:0)

如果您正在使用Apache Spark(即:我忽略了EMR中构建的Amazon版本),您需要在org.apache.hadoop:hadoop-aws上添加一个依赖项,以获得与其余spark使用完全相同的Hadoop版本。这会添加S3a FS和传递依赖项。 AWS SDK 的版本必须与用于构建hadoop-aws库的版本相同,因为它有点像移动目标。

请参阅:Apache Spark and Object Stores