GCP Dataproc spark.jar.packages问题下载依赖项

时间:2017-11-17 11:13:22

标签: google-cloud-platform google-cloud-dataproc gcp spark-submit

创建Dataproc Spark群集时,我们通过了 --properties spark:spark.jars.packages=mysql:mysql-connector-java:6.0.6命令gcloud dataproc clusters create

这是我们的PySpark脚本保存到CloudSQL

显然在创作时,这并没有做任何事情,但在第一个spark-submit,这将尝试解决这种依赖。

从技术上讲,似乎已解决并下载必要的jar文件,但由于spark-submit

发出的警告,群集上的第一项任务将失败
Exception in thread "main" java.lang.RuntimeException: [download failed: mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar]
    at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1177)
    at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:298)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

完整的输出是:

Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
mysql#mysql-connector-java added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found mysql#mysql-connector-java;6.0.6 in central
downloading https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar ...
:: resolution report :: resolve 527ms :: artifacts dl 214ms
    :: modules in use:
    mysql#mysql-connector-java;6.0.6 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   1   |   1   |   1   |   0   ||   1   |   0   |
    ---------------------------------------------------------------------

:: problems summary ::
:::: WARNINGS
        [FAILED     ] mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar: Downloaded file size doesn't match expected Content Length for https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar. Please retry. (212ms)

        [FAILED     ] mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar: Downloaded file size doesn't match expected Content Length for https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar. Please retry. (212ms)

    ==== central: tried

      https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar

        ::::::::::::::::::::::::::::::::::::::::::::::

        ::              FAILED DOWNLOADS            ::

        :: ^ see resolution messages for details  ^ ::

        ::::::::::::::::::::::::::::::::::::::::::::::

        :: mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar

        ::::::::::::::::::::::::::::::::::::::::::::::

但群集上的后续任务显示此输出

Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
mysql#mysql-connector-java added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found mysql#mysql-connector-java;6.0.6 in central
:: resolution report :: resolve 224ms :: artifacts dl 5ms
    :: modules in use:
    mysql#mysql-connector-java;6.0.6 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   1   |   0   |   0   |   0   ||   1   |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
    confs: [default]
    0 artifacts copied, 1 already retrieved (0kB/7ms)

所以我的问题是:

  1. 原因是什么?这可以由GCP的好人解决吗?
  2. 除了运行在群集开始时允许失败的虚拟任务外,还有临时解决方法吗?

1 个答案:

答案 0 :(得分:1)

您如何一致地重现这一点?尝试使用不同的群集设置重现后,我的最佳理论是这可能是一个重载服务器,返回5xx错误。

就解决方法而言:

1)从Maven Central下载jar并在提交作业时使用spark.jars.ivySettings选项传递它。如果您经常创建新群集,而不是通过初始化操作在群集上暂存此文件,那么就可以了。

2)通过指向Google Maven Central镜像的var returnedJsonString="[ { "id": 1, "message": "test1", "expiration": "2017-11-17" }, { "id": 2, "message": "test2", "expiration": "2017-11-17" } ]"; var data=JSON.parse(returnedJsonString);//convert json string to object array var msgs=[];//define an array to store the messages as you retrieve data.forEach(function(o){ msgs.push(o.message); }); alert(msgs.join()); 属性提供备用常春藤设置文件(这可以减少/消除5xx错误的几率)

看到这篇文章: https://www.infoq.com/news/2015/11/maven-central-at-google