I am trying to read a csv (native) file from an S3 bucket using a locally running Spark - Scala. I am able to read the file using the http protocol but I intend to use the s3a protocol.
Below is the configuration setup before the call
spark.sparkContext.hadoopConfiguration.set(“ fs.s3a.impl”,“ org.apache.hadoop.fs.s3a.S3AFileSystem”) spark.sparkContext.hadoopConfiguration.set(“ fs.s3a.access.key”,“ Mykey”) spark.sparkContext.hadoopConfiguration.set(“ fs.s3a.secret.key”,“ Mysecretkey”) spark.sparkContext.hadoopConfiguration.set(“ fs.s3a.aws.credentials.provider”,“ org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider”); spark.sparkContext.hadoopConfiguration.set(“ com.amazonaws.services.s3.enableV4”,“ true”) spark.sparkContext.hadoopConfiguration.set(“ fs.s3a.endpoint”,“ eu-west-1.amazonaws.com”) spark.sparkContext.hadoopConfiguration.set(“ fs.s3a.impl.disable.cache”,“ true”)
I am getting bellow exception:
1. Exception in thread "main" java.lang.RuntimeException:
java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2154)
at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2580)
my spark version is: 2.3.1
scala version: 2.11
aws-java-sdk vesrion : 1.11.336
hadoop-aws :2.8.4
答案 0 :(得分:0)
这是缺少S3 sdk库的异常,更多详细信息可以在https://community.hortonworks.com/articles/25523/hdp-240-and-spark-160-connecting-to-aws-s3-buckets.html
中找到基本,当您看到ClassNotFound异常时,它是由JVM类路径中缺少一些二进制文件引起的,或者是根类加载器将从Java运行时目录和应用程序当前目录中加载它们,或者外部类加载器从某些类中加载它给定路径,请仔细检查。 可能是您需要阅读有关ClassLoader的更多文档,对它进行Google搜索:)
答案 1 :(得分:0)
重要:类路径设置
https://cwiki.apache.org/confluence/display/HADOOP2/AmazonS3