背景:
我们随产品一起提供了Spark版本2.3.1。最近,我们通过包含johnsnowlabs / spark-nlp库增加了对某些NLP任务的支持。我们似乎自己遇到了一些回归问题,显然没有合理的解释!
最终,我们要做的是尝试使用一些命令行参数来调用spark-shell。通过轮询jvm进程,我可以看到正在执行的命令如下:
重击命令
/Library/Java/JavaVirtualMachines/jdk1.8.0_161.jdk/Contents/Home/bin/java
-cp /Users/path/to/apps/spark-dist/conf/:/Users/path/to/apps/spark-dist/jars/*:/Users/path/to/var/api/work/spark-shaded-3d817f55997b81a17a7f7cb2df2411bd.jar
-Dscala.usejavacp=true
-Xmx520m
org.apache.spark.deploy.SparkSubmit
--master local[*]
--class org.apache.spark.repl.Main
--name Spark_shell
--total-executor-cores 4
--jars /Users/path/to/var/api/work/spark-shaded-3d817f55997b81a17a7f7cb2df2411bd.jar,/Users/path/to/apps/spark/driver/lib/hadoop-aws-2.7.5.jar,/Users/path/to/apps/spark/driver/lib/aws-java-sdk-1.7.4.jar
spark-shell
-i /Users/apiltamang/Downloads/test_sparknlp_code_1275.scala
基本上,以上命令尝试在文件test_sparknlp_code_1275.scala
中运行带有参数jar的spark-nlp代码:
-cp /Users/path/to/apps/spark-dist/conf/:/Users/path/to/apps/spark-dist/jars/*:/Users/path/to/var/api/work/spark-shaded-3d817f55997b81a17a7f7cb2df2411bd.jar
适用于java
cmd,和--jars /Users/path/to/var/api/work/spark-shaded-3d817f55997b81a17a7f7cb2df2411bd.jar,/Users/path/to/apps/spark/driver/lib/hadoop-aws-2.7.5.jar,/Users/path/to/apps/spark/driver/lib/aws-java-sdk-1.7.4.jar
适用于spark-shell。关于spark-shaded-3d817f55997b81a17a7f7cb2df2411bd.jar
的字词。从本质上讲,它是一个胖瓶子,其中包含我们所有的产品类别以及johnsnowlabs/spark-nlp
的类别。同样,包含hadoop和aws-java-sdk jar来满足spark-nlp类的依赖性。
错误
足够简单,或者看起来如此,但是事实是上述方法不起作用。脚本错误显示为:
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.annotators.ner.NerConverter
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.util.Benchmark
java.lang.NoClassDefFoundError: com/amazonaws/auth/AnonymousAWSCredentials
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.<init>(ResourceDownloader.scala:51)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.<clinit>(ResourceDownloader.scala)
at com.johnsnowlabs.nlp.annotators.ner.dl.PretrainedNerDL$class.pretrained$default$3(NerDLModel.scala:117)
at com.johnsnowlabs.nlp.annotator$NerDLModel$.pretrained$default$3(annotator.scala:95)
... 82 elided
Caused by: java.lang.ClassNotFoundException: com.amazonaws.auth.AnonymousAWSCredentials
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 86 more
什么有效?
如果我从spark-shaded-3d817f55997b81a17a7f7cb2df2411bd.jar
参数中排除了有阴影的jar文件(-cp
),它会起作用。如果将hadoo-aws-2.7.5.jar
和aws-java-sdk:1.7.4.jar
添加到-cp
参数,它也可以工作。 为什么在jar 中为什么有hadoop-aws和aws-java-sdk依赖项 ? em>不是 吗?这应该是为spark-shell脚本添加jar依赖项的首选方式!导致其中断的-cp
参数是什么?
FOOTNOTE
我尝试使用以下方法检查阴影罐的内容:
jar tf spark-shaded-3d817f55997b81a17a7f7cb2df2411bd.jar | grep aws-java-sdk
这给出了以下列表:
META-INF/maven/com.amazonaws/aws-java-sdk-kms/
META-INF/maven/com.amazonaws/aws-java-sdk-kms/pom.xml
META-INF/maven/com.amazonaws/aws-java-sdk-kms/pom.properties
META-INF/maven/com.amazonaws/aws-java-sdk-core/
META-INF/maven/com.amazonaws/aws-java-sdk-core/pom.xml
META-INF/maven/com.amazonaws/aws-java-sdk-core/pom.properties
META-INF/maven/com.amazonaws/aws-java-sdk-s3/
META-INF/maven/com.amazonaws/aws-java-sdk-s3/pom.xml
META-INF/maven/com.amazonaws/aws-java-sdk-s3/pom.properties
META-INF/maven/com.amazonaws/aws-java-sdk-dynamodb/
META-INF/maven/com.amazonaws/aws-java-sdk-dynamodb/pom.xml
META-INF/maven/com.amazonaws/aws-java-sdk-dynamodb/pom.properties
我没有在阴影罐中看到任何与aws-java-sdk相关的实际.class文件,尽管其中引用了该库的较新版本。也许火花找到了这些,只是停止寻找...!?而不是检查使用--jars通过的jar?目前不确定!任何见解都非常感激。