提交Google Dataproc Hadoop Job时找不到Hadoop Streaming jar?

时间:2019-01-02 15:52:31

标签: hadoop-streaming google-cloud-dataproc

当尝试以编程方式(使用dataproc库的Java应用程序)提交Hadoop MapReduce作业时,该作业立即失败。通过UI提交完全相同的作业时,效果很好。

我尝试通过SSH进入Dataproc集群以确认文件存在,检查权限并更改了jar引用。什么都没有。

我得到的错误:

Exception in thread "main" java.lang.ClassNotFoundException: file:///usr/lib/hadoop-mapreduce/hadoop-streaming-2.8.4.jar
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:18)
Job output is complete

当我在控制台中克隆失败的作业并查看等效的REST时,这就是我所看到的:

POST /v1/projects/project-id/regions/us-east1/jobs:submit/
{
  "projectId": "project-id",
  "job": {
    "reference": {
      "projectId": "project-id",
      "jobId": "jobDoesNotWork"
    },
    "placement": {
      "clusterName": "cluster-name",
      "clusterUuid": "uuid"
    },
    "submittedBy": "service-account@project.iam.gserviceaccount.com",
    "jobUuid": "uuid",
    "hadoopJob": {
      "args": [
        "-Dmapred.reduce.tasks=20",
        "-Dmapred.output.compress=true",
        "-Dmapred.compress.map.output=true",
        "-Dstream.map.output.field.separator=,",
        "-Dmapred.textoutputformat.separator=,",
        "-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec",
        "-Dmapreduce.input.fileinputformat.split.minsize=268435456",
        "-Dmapreduce.input.fileinputformat.split.maxsize=268435456",
        "-mapper",
        "/bin/cat",
        "-reducer",
        "/bin/cat",
        "-inputformat",
        "org.apache.hadoop.mapred.lib.CombineTextInputFormat",
        "-outputformat",
        "org.apache.hadoop.mapred.TextOutputFormat",
        "-input",
        "gs://input/path/",
        "-output",
        "gs://output/path/"
      ],
      "mainJarFileUri": "file:///usr/lib/hadoop-mapreduce/hadoop-streaming-2.8.4.jar"
    }
  }
}

当我通过控制台提交作业时,它将起作用。相当于该工作的REST:

POST /v1/projects/project-id/regions/us-east1/jobs:submit/
{
  "projectId": "project-id",
  "job": {
    "reference": {
      "projectId": "project-id,
      "jobId": "jobDoesWork"
    },
    "placement": {
      "clusterName": "cluster-name,
      "clusterUuid": ""
    },
    "submittedBy": "user_email_account@email.com",
    "jobUuid": "uuid",
    "hadoopJob": {
      "args": [
        "-Dmapred.reduce.tasks=20",
        "-Dmapred.output.compress=true",
        "-Dmapred.compress.map.output=true",
        "-Dstream.map.output.field.separator=,",
        "-Dmapred.textoutputformat.separator=,",
        "-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec",
        "-Dmapreduce.input.fileinputformat.split.minsize=268435456",
        "-Dmapreduce.input.fileinputformat.split.maxsize=268435456",
        "-mapper",
        "/bin/cat",
        "-reducer",
        "/bin/cat",
        "-inputformat",
        "org.apache.hadoop.mapred.lib.CombineTextInputFormat",
        "-outputformat",
        "org.apache.hadoop.mapred.TextOutputFormat",
        "-input",
        "gs://input/path/",
        "-output",
        "gs://output/path/"
      ],
      "mainJarFileUri": "file:///usr/lib/hadoop-mapreduce/hadoop-streaming-2.8.4.jar"
    }
  }
}

我把它塞进盒子里,确认文件确实存在。我真正看到的唯一区别是“ submittedBy”。一种有效,一种无效。我猜想这是一个权限问题,但是我似乎无法确定每种情况下从何处提取权限。在这两种情况下,都将使用相同的服务帐户创建Dataproc集群。

查看群集上该jar的权限,我看到了:

-rw-r--r-- 1 root root  133856 Nov 27 20:17 hadoop-streaming-2.8.4.jar
lrwxrwxrwx 1 root root      26 Nov 27 20:17 hadoop-streaming.jar -> hadoop-streaming-2.8.4.jar

我尝试将mainJarFileUri从显式指向版本化的jar更改为链接(因为它具有打开权限),但并没有真正期望它会起作用。事实并非如此。

任何具有Dataproc经验的人都知道这里发生了什么以及如何解决吗?

1 个答案:

答案 0 :(得分:3)

在代码中容易犯的一个常见错误是,当您打算调用setMainClass时调用setMainJarFileUri,反之亦然。您收到的java.lang.ClassNotFoundException表示Dataproc试图将jarfile字符串作为类名而不是jarfile提交,因此Dataproc认为您设置了main_class。您可能需要仔细检查代码,看看这是否是您遇到的错误。

在GUI中使用“克隆作业”的原因隐藏了此问题,因为GUI试图通过提供用于设置main_classmain_jar_file_uri的单个文本框来变得更加用户友好,并进行推断通过查看文件扩展名是否为jarfile。因此,如果您在main_class字段中提交具有jarfile URI的作业,但该作业失败,则您单击clone并提交新作业,GUI将尝试变得聪明并识别出新作业实际上指定了一个jarfile名称,因此将正确设置JSON请求中的main_jar_file_uri字段而不是main_class