Spark Shell:通过--jars选项添加依赖项不起作用

时间:2018-09-20 22:02:27

标签: java apache-spark classpath spark-shell

背景

我们随产品一起提供了Spark版本2.3.1。最近,我们通过包含johnsnowlabs / spark-nlp库增加了对某些NLP任务的支持。我们似乎自己遇到了一些回归问题,显然没有合理的解释!

最终,我们要做的是尝试使用一些命令行参数来调用spark-shell。通过轮询jvm进程,我可以看到正在执行的命令如下:

重击命令

/Library/Java/JavaVirtualMachines/jdk1.8.0_161.jdk/Contents/Home/bin/java 
    -cp /Users/path/to/apps/spark-dist/conf/:/Users/path/to/apps/spark-dist/jars/*:/Users/path/to/var/api/work/spark-shaded-3d817f55997b81a17a7f7cb2df2411bd.jar 
    -Dscala.usejavacp=true 
    -Xmx520m 
    org.apache.spark.deploy.SparkSubmit 
    --master local[*]  
    --class org.apache.spark.repl.Main 
    --name Spark_shell 
    --total-executor-cores 4 
    --jars /Users/path/to/var/api/work/spark-shaded-3d817f55997b81a17a7f7cb2df2411bd.jar,/Users/path/to/apps/spark/driver/lib/hadoop-aws-2.7.5.jar,/Users/path/to/apps/spark/driver/lib/aws-java-sdk-1.7.4.jar 
    spark-shell 
    -i /Users/apiltamang/Downloads/test_sparknlp_code_1275.scala

基本上,以上命令尝试在文件test_sparknlp_code_1275.scala中运行带有参数jar的spark-nlp代码:

  1. -cp /Users/path/to/apps/spark-dist/conf/:/Users/path/to/apps/spark-dist/jars/*:/Users/path/to/var/api/work/spark-shaded-3d817f55997b81a17a7f7cb2df2411bd.jar适用于java cmd,和
  2. --jars /Users/path/to/var/api/work/spark-shaded-3d817f55997b81a17a7f7cb2df2411bd.jar,/Users/path/to/apps/spark/driver/lib/hadoop-aws-2.7.5.jar,/Users/path/to/apps/spark/driver/lib/aws-java-sdk-1.7.4.jar适用于spark-shell。

关于spark-shaded-3d817f55997b81a17a7f7cb2df2411bd.jar的字词。从本质上讲,它是一个胖瓶子,其中包含我们所有的产品类别以及johnsnowlabs/spark-nlp的类别。同样,包含hadoop和aws-java-sdk jar来满足spark-nlp类的依赖性。

错误

足够简单,或者看起来如此,但是事实是上述方法不起作用。脚本错误显示为:

import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.annotators.ner.NerConverter
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.util.Benchmark
java.lang.NoClassDefFoundError: com/amazonaws/auth/AnonymousAWSCredentials
  at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.<init>(ResourceDownloader.scala:51)
  at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.<clinit>(ResourceDownloader.scala)
  at com.johnsnowlabs.nlp.annotators.ner.dl.PretrainedNerDL$class.pretrained$default$3(NerDLModel.scala:117)
  at com.johnsnowlabs.nlp.annotator$NerDLModel$.pretrained$default$3(annotator.scala:95)
  ... 82 elided
Caused by: java.lang.ClassNotFoundException: com.amazonaws.auth.AnonymousAWSCredentials
  at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 86 more

什么有效?

如果我从spark-shaded-3d817f55997b81a17a7f7cb2df2411bd.jar参数中排除了有阴影的jar文件(-cp),它会起作用。如果将hadoo-aws-2.7.5.jaraws-java-sdk:1.7.4.jar添加到-cp参数,它也可以工作。 为什么在jar 中为什么有hadoop-aws和aws-java-sdk依赖项 ? em>不是 吗?这应该是为spark-shell脚本添加jar依赖项的首选方式!导致其中断的-cp参数是什么?

FOOTNOTE

我尝试使用以下方法检查阴影罐的内容: jar tf spark-shaded-3d817f55997b81a17a7f7cb2df2411bd.jar | grep aws-java-sdk

这给出了以下列表:

META-INF/maven/com.amazonaws/aws-java-sdk-kms/
META-INF/maven/com.amazonaws/aws-java-sdk-kms/pom.xml
META-INF/maven/com.amazonaws/aws-java-sdk-kms/pom.properties
META-INF/maven/com.amazonaws/aws-java-sdk-core/
META-INF/maven/com.amazonaws/aws-java-sdk-core/pom.xml
META-INF/maven/com.amazonaws/aws-java-sdk-core/pom.properties
META-INF/maven/com.amazonaws/aws-java-sdk-s3/
META-INF/maven/com.amazonaws/aws-java-sdk-s3/pom.xml
META-INF/maven/com.amazonaws/aws-java-sdk-s3/pom.properties
META-INF/maven/com.amazonaws/aws-java-sdk-dynamodb/
META-INF/maven/com.amazonaws/aws-java-sdk-dynamodb/pom.xml
META-INF/maven/com.amazonaws/aws-java-sdk-dynamodb/pom.properties

我没有在阴影罐中看到任何与aws-java-sdk相关的实际.class文件,尽管其中引用了该库的较新版本。也许火花找到了这些,只是停止寻找...!?而不是检查使用--jars通过的jar?目前不确定!任何见解都非常感激。

0 个答案:

没有答案