Question

我有一个使用org.apache.hadoop.mapreduce.lib.output.MultipleOutputs编写多个输出的代码。

Reducer将结果写入预先创建的位置，因此我不需要默认的o / p目录（包含_history和_SUCCESS目录）。

我必须在每次再次运作之前删除它们。

所以我删除了TextOutputFormat.setOutputPath(job1,new Path(outputPath));行。但是，这给了我（预期）错误org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set

驱动程序类：

MultipleOutputs.addNamedOutput(job1, "path1", TextOutputFormat.class, Text.class,LongWritable.class);
MultipleOutputs.addNamedOutput(job1, "path2", TextOutputFormat.class, Text.class,LongWritable.class);
LazyOutputFormat.setOutputFormatClass(job1,TextOutputFormat.class);

减速机等级：

if(condition1)
    mos.write("path1", key, new LongWritable(value), path_list[0]);
else
    mos.write("path2", key, new LongWritable(value), path_list[1]);

是否有避免指定默认输出目录的解决方法？

Answer 1

我不认为_SUCCESS是一个目录而另一个history目录位于_logs目录中。

首先，TextOutputFormat.setOutputPath(job1,new Path(outputPath));很重要，因为当作业运行时，Hadoop将此路径作为工作目录来创建临时文件，例如用于不同的任务（_temporary目录）。这个_temporary目录和文件最终会在作业结束时被删除。文件_SUCCESS和历史目录实际上是工作目录下的内容，并在作业成功完成后保留。 _SUCCESS文件是一种标志，表示作业实际上已成功运行。请查看at this link。

您的文件 _SUCCESS 的创建由您实际使用的TextOutputFormat类完成，而后者又使用FileOutputComitter类。 FileOutputCommiter类定义了一个这样的函数 -

 public static final String SUCCEEDED_FILE_NAME = "_SUCCESS";
/**
   * Delete the temporary directory, including all of the work directories.
   * This is called for all jobs whose final run state is SUCCEEDED
   * @param context the job's context.
   */
  public void commitJob(JobContext context) throws IOException {
    // delete the _temporary folder
    cleanupJob(context);
    // check if the o/p dir should be marked
    if (shouldMarkOutputDir(context.getConfiguration())) {
      // create a _success file in the o/p folder
      markOutputDirSuccessful(context);
    }
  }

// Mark the output dir of the job for which the context is passed.
  private void markOutputDirSuccessful(JobContext context)
  throws IOException {
    if (outputPath != null) {
      FileSystem fileSys = outputPath.getFileSystem(context.getConfiguration());
      if (fileSys.exists(outputPath)) {
        // create a file in the folder to mark it
        Path filePath = new Path(outputPath, SUCCEEDED_FILE_NAME);
        fileSys.create(filePath).close();
      }
    }
  }

由于markOutputDirSuccessful（）是私有的，因此您必须覆盖commitJob（）以绕过 SUCCEEDED_FILE_NAME 创建过程并实现您想要的效果。

如果您想稍后使用hadoop HistoryViewer实际获取作业运行方式的报告，则下一个目录 _logs 非常重要。

我认为，当您使用相同的输出目录作为另一个Job的输入时，由于在Hadoop中设置 Filter ，文件* _SUCCESS *和目录* _logs *将被忽略。

此外，当您为MultipleOutputs定义一个namedoutput时，您可以改为写入TextOutputFormat.setOutputPath（）函数中提到的outpath内的子目录，然后使用该路径作为下一个作业的输入。 ;将会运行。

我实际上并没有看到_SUCCESS和_logs会如何打扰你？

由于

Answer 2

问题已经很久了，仍在分享答案，

This回答适合问题中的情景。

定义您的OutputFormat以表示您不期望任何输出。你可以这样做：

job.setOutputFormat(NullOutputFormat.class);

或

您也可以使用LazyOutputFormat

import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat; 
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);

Credits @charlesmenguy

Answer 3

你在运行什么版本的Hadoop？

要快速解决方法，您可以以编程方式设置输出位置，并在作业完成时调用FileSystem.delete将其删除。

完全取消默认输出目录 - MapReduce

3 个答案: