Spark + s3-错误-java.lang.ClassNotFoundException:找不到类org.apache.hadoop.fs.s3a.S3AFileSystem

时间:2019-10-16 14:38:46

标签: apache-spark amazon-s3 pyspark apache-zeppelin

我有一个Spark ec2集群,我在这里从Zeppelin笔记本电脑提交pyspark程序。我已经加载了hadoop-aws-2.7.3.jar和aws-java-sdk-1.11.179.jar并将它们放置在spark实例的/ opt / spark / jars目录中。我收到java.lang.NoClassDefFoundError:com / amazonaws / AmazonServiceException

为什么火花看不到罐子?我是否必须在所有从站中添加jar并为主站和从站指定spark-defaults.conf? zeppelin中是否需要配置一些内容才能识别新的jar文件?

我已将jar文件/ opt / spark / jars放置在spark master上。我创建了spark-defaults.conf并添加了行

spark.hadoop.fs.s3a.access.key     [ACCESS KEY]
spark.hadoop.fs.s3a.secret.key     [SECRET KEY]
spark.hadoop.fs.s3a.impl           org.apache.hadoop.fs.s3a.S3AFileSystem
spark.driver.extraClassPath        /opt/spark/jars/hadoop-aws-2.7.3.jar:/opt/spark/jars/aws-java-sdk-1.11.179.jar

我有齐柏林飞艇的口译员向火花大师发送火花。

我也将jar放在了奴隶的/ opt / spark / jars中,但是没有创建spark-deafults.conf。

%spark.pyspark

#importing necessary libaries
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import StringType
from pyspark import SQLContext
from itertools import islice
from pyspark.sql.functions import col

# add aws credentials
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "[ACCESS KEY]")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "[SECRET KEY]")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

#creating the context
sqlContext = SQLContext(sc)

#reading the first csv file and store it in an RDD
rdd1= sc.textFile("s3a://filepath/baby-names.csv").map(lambda line: line.split(","))

#removing the first row as it contains the header
rdd1 = rdd1.mapPartitionsWithIndex(
lambda idx, it: islice(it, 1, None) if idx == 0 else it
)

#converting the RDD into a dataframe
df1 = rdd1.toDF(['year','name', 'percent', 'sex'])

#print the dataframe
df1.show()

引发错误:


Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, 10.11.93.90, executor 1): java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:108)
    at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
    at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
    at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:266)
    at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
    at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.amazonaws.AmazonServiceException
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 34 more

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
    at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:153)
    at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:108)
    at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
    at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
    at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:266)
    at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
    at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more
Caused by: java.lang.ClassNotFoundException: com.amazonaws.AmazonServiceException
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 34 more

5 个答案:

答案 0 :(得分:1)

我能够解决上述问题,以确保每个运行的spark hadoop版本都具有正确版本的hadoop aws jar,下载正确版本的aws-java-sdk,最后下载依赖项jets3t库

enter image description here

在/ opt / spark / jars

sudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.11.30/aws-java-sdk-1.11.30.jar
sudo wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar
sudo wget https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar

进行测试

scala> sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", [ACCESS KEY ID])
scala> sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", [SECRET ACCESS KEY] )
scala> val myRDD = sc.textFile("s3n://adp-px/baby-names.csv")
scala> myRDD.count()
res2: Long = 49

答案 1 :(得分:1)

如果 S3 访问是通过来自本地集群的假设角色那么下面对我有用。

import boto3
import pyspark as pyspark
from pyspark import SparkContext

session = boto3.session.Session(profile_name='profile_name')
sts_connection = session.client('sts')
response = sts_connection.assume_role(RoleArn='arn:aws:iam:::role/role_name', RoleSessionName='role_name',DurationSeconds=3600)
credentials = response['Credentials']

conf = pyspark.SparkConf()

conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.0')  //crosscheck the version. 

sc = SparkContext(conf=conf)
sc._jsc.hadoopConfiguration().set('fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider')
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', credentials['AccessKeyId'])
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', credentials['SecretAccessKey'])
sc._jsc.hadoopConfiguration().set('fs.s3a.session.token', credentials['SessionToken'])
url = str('s3a://data.csv')

l1 = sc.textFile(url).collect()
for each in l1:
    print(str(each))
    break

在 $SPARK_HOME/jars 中保持低于正确版本的类文件

  1. jets3t
  2. aws-java-sdk
  3. hadoop-aws

我更喜欢从 ~/.ivy2/jars 中删除不需要的 jars

答案 2 :(得分:0)

将以下内容添加到此文件hadoop/etc/hadoop/core-site.xml

<property>
  <name>fs.s3.awsAccessKeyId</name>
  <value>***</value>
</property>
<property>
  <name>fs.s3.awsSecretAccessKey</name>
  <value>***</value>
</property>

在Hadoop安装目录中,找到aws jars,因为MAC安装目录为/usr/local/Cellar/hadoop/

find . -type f -name "*aws*"

sudo cp hadoop/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar hadoop/share/hadoop/common/lib/
sudo cp hadoop/share/hadoop/tools/lib/hadoop-aws-2.7.5.jar hadoop/share/hadoop/common/lib/

Credit

答案 3 :(得分:0)

以下为我工作

我的系统配置:

Ubuntu 16.04.6 LTS python3.7.7 openjdk版本1.8.0_252 spark-2.4.5-bin-hadoop2.7

  1. 配置PYSPARK_PYTHON路径: 在$ spark_home / conf / spark-env.sh中添加以下行

    export PYSPARK_PYTHON = python_env_path / bin / python

  2. 启动pyspark

    pyspark-包com.amazonaws:aws-java-sdk-pom:1.11.760,org.apache.hadoop:hadoop-aws:2.7.0 --conf spark.hadoop.fs.s3a.endpoint = s3 .us-west-2.amazonaws.com

    com.amazonaws:aws-java-sdk-pom:1.11.760:取决于jdk版本 hadoop:hadoop-aws:2.7.0:取决于您的hadoop版本 s3.us-west-2.amazonaws.com:取决于您的s3位置

3。从s3读取数据

df2=spark.read.parquet("s3a://s3location_file_path")

Credits

答案 4 :(得分:0)

如果以上方法均无效,则对丢失的类进行分类处理。 Jar损坏的可能性很高。 例如,如果未找到类AmazonServiceException,则执行grep,该jar已存在,如下所示。

grep "AmazonServiceException" *.jar