Hadoop 2.9.2,Spark 2.4.0访问AWS s3a存储桶

时间:2018-12-25 13:14:24

标签: amazon-web-services apache-spark hadoop

已经过了几天,但是我无法使用Spark从公共Amazon Bucket下载:(

这是spark-shell命令:

spark-shell  --master yarn
              -v
              --jars file:/usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar,file:/usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar
              --driver-class-path=/usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar:/usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar

应用程序启动,shell等待提示:

   ____              __
  / __/__  ___ _____/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 2.4.0
   /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val data1 = sc.textFile("s3a://my-bucket-name/README.md")

18/12/25 13:06:40 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 242.1 KB, free 246.7 MB)
18/12/25 13:06:40 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.2 KB, free 246.6 MB)
18/12/25 13:06:40 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop-edge01:3545 (size: 24.2 KB, free: 246.9 MB)
18/12/25 13:06:40 INFO SparkContext: Created broadcast 0 from textFile at <console>:24
data1: org.apache.spark.rdd.RDD[String] = s3a://my-bucket-name/README.md MapPartitionsRDD[1] at textFile at <console>:24

scala> data1.count()

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:97)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:206)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD.count(RDD.scala:1168)
... 49 elided
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.fs.StorageStatistics
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 77 more

scala>
  1. 所有AWS密钥,秘密密钥均已在hadoop / core-site.xml中进行了设置,如下所述:Hadoop-AWS module: Integration with Amazon Web Services
  2. 存储桶是公共的-任何人都可以下载(已通过curl -O测试)
  3. 您看到的所有.jars都是由Hadoop本身从/usr/local/hadoop/share/hadoop/tools/lib/文件夹中提供的
  4. spark-defaults.conf中没有其他设置-仅是在命令行中发送的设置
  5. 两个罐子都不提供此类:

    jar tf /usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar | grep org/apache/hadoop/fs/StorageStatistics
    (no result)
    
    jar tf /usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar | grep org/apache/hadoop/fs/StorageStatistics
    (no result)
    

我该怎么办?我忘了加另一个罐子吗? hadoop-awsaws-java-sdk-bundle的确切配置是什么?版本?

3 个答案:

答案 0 :(得分:7)

嗯....我终于找到问题了。

主要问题是我为Hadoop预安装了Spark。它是“针对Hadoop 2.7及更高版本的v2.4.0预先构建”。正如您在上面看到的我为之奋斗时所说的那样,这有点误导标题。实际上,Spark附带了不同版本的hadoop jars。 / usr / local / spark / jars /中的清单显示它具有:

  

hadoop-common-2.7.3.jar
  hadoop-client-2.7.3.jar
   ....

它仅丢失:hadoop-aws和aws-java-sdk。我在Maven存储库hadoop-aws-v2.7.3中及其相关性:aws-java-sdk-v1.7.4和voila上进行了一些研究!下载了这些jar并将 them 作为参数发送到Spark。像这样:

  

火花壳
    -母纱
    -v
    --jars文件:/home/aws-java-sdk-1.7.4.jar,文件:/home/hadoop-aws-2.7.3.jar
    --driver-class-path = / home / aws-java-sdk-1.7.4.jar:/home/hadoop-aws-2.7.3.jar

完成工作了!!!

我只是想知道为什么来自Hadoop的所有jar(我都将它们作为参数发送到--jars和--driver-class-path)没有赶上。 Spark会以某种方式自动选择罐子,而不是我发送的罐子

答案 1 :(得分:0)

我建议您不要做什么。 您在hadoop 2.9.2上运行带有hadoop 2.7.2 jar的预制火花。并且您在类路径中添加了更多的jar,以便使用hadoop 2.7.3版本中的s3解决该问题。

您应该做的是使用“没有hadoop的” spark版本-并按配置提供hadoop文件:如下面的链接所示- https://spark.apache.org/docs/2.4.0/hadoop-provided.html

主要部分:

在conf / spark-env.sh

如果“ hadoop”二进制文件位于您的PATH中 导出SPARK_DIST_CLASSPATH = $(hadoop类路径)

具有指向“ hadoop”二进制文件的显式路径 导出SPARK_DIST_CLASSPATH = $(/ path / to / hadoop / bin / hadoop类路径)

传递Hadoop配置目录 导出SPARK_DIST_CLASSPATH = $(hadoop --config / path / to / configs类路径)

答案 2 :(得分:0)

我使用spark 2.4.5,这就是我所做的,并且对我有用。我可以从本地的Spark连接到AWS s3。

{{1}}