Question

我是Hadoop的新手，但这是我上个月的一个学习项目。

为了保持这种含糊不清以便对他人有用，让我先抛弃基本目标......假设：

您拥有数百万个基本ASCII文本文件的大型数据集（显然）。
- 每个文件都是“记录”。
记录存储在目录结构中以识别客户＆amp;日期
- e.g。 / user / hduser / data / customer1 / YYYY-MM-DD，/ user / hduser / data / customer2 / YYYY-MM-DD
您想模仿输出结构的输入结构
- e.g。 / user / hduser / out / customer1 / YYYY-MM-DD，/ user / hduser / out / customer2 / YYYY-MM-DD

我查看了多个主题：

Multiple output path java hadoop mapreduce
MultipleTextOutputFormat alternative in new api
Separate Output files in Hadoop mapreduce
Speculative Task Execution - 尝试解决-m-part #### issue

还有更多......我也一直在阅读Tom White的Hadoop书。我一直在急切地想要学习这一点。而且我经常在新API和旧API之间进行交换，这增加了尝试学习这一点的困惑。

许多人指出MultipleOutputs（或旧的api版本），但我似乎无法生成我想要的输出 - 例如，MultipleOutputs似乎不接受“/”来创建一个write（）

需要采取哪些步骤来创建具有所需输出结构的文件？目前我有一个WholeFileInputFormat类，以及一个具有（NullWritable K，ByteWritable V）对的相关RecordReader（如果需要可以改变）

我的地图设置：

public class MapClass extends Mapper<NullWritable, BytesWritable, Text, BytesWritable> {
    private Text filenameKey;
    private MultipleOutputs<NullWritable, Text> mos;

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        InputSplit split = context.getInputSplit();
        Path path = ((FileSplit) split).getPath();
        filenameKey = new Text(path.toString().substring(38)); // bad hackjob, until i figure out a better way.. removes hdfs://master:port/user/hduser/path/
        mos = new MultipleOutputs(context);
    }
}

还有一个 cleanup（）函数调用 mos.close（），而 map（）函数目前是未知的（这里我需要帮助）

这是否足以将新手指向答案的方向？我的下一个想法是在每个map（）任务中创建一个MultipleOutputs（）对象，每个都有一个新的baseoutput字符串，但我不确定它是否有效甚至是正确的行动。

建议将不胜感激，程序中的任何内容都可以在此时更改，除了输入 - 我只是想学习框架 - 但我希望尽可能接近这个结果（稍后我可能会考虑将记录与较大的文件组合，但它们已经是每条记录20MB，我想确保它在我无法在记事本中读取之前有效

编辑：可以通过修改/扩展TextOutputFormat.class来解决这个问题吗？似乎它可能有一些可行的方法，但我不确定我需要覆盖哪些方法......

Answer 1

如果关闭推测性执行，则无法阻止您在映射器中手动创建输出文件夹结构/文件，并将记录写入它们（忽略输出上下文/收集器）

例如，扩展代码片段（设置方法），您可以执行类似这样的操作（基本上是多个输出正在执行的操作，但假设关闭推测执行以避免文件冲突，其中两个映射任务正在尝试写入到相同的输出文件）：

import java.io.IOException;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class MultiOutputsMapper extends
        Mapper<LongWritable, Text, NullWritable, NullWritable> {
    protected String filenameKey;
    private RecordWriter<Text, Text> writer;
    private Text outputValue;
    private Text outputKey;

    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        // operate on the input record
        // ...

        // write to output file using writer rather than context
        writer.write(outputKey, outputValue);
    }

    @Override
    protected void setup(Context context) throws IOException,
            InterruptedException {
        InputSplit split = context.getInputSplit();
        Path path = ((FileSplit) split).getPath();

        // extract parent folder and filename
        filenameKey = path.getParent().getName() + "/" + path.getName();

        // base output folder
        final Path baseOutputPath = FileOutputFormat.getOutputPath(context);
        // output file name
        final Path outputFilePath = new Path(baseOutputPath, filenameKey);

        // We need to override the getDefaultWorkFile path to stop the file being created in the _temporary/taskid folder
        TextOutputFormat<Text, Text> tof = new TextOutputFormat<Text, Text>() {
            @Override
            public Path getDefaultWorkFile(TaskAttemptContext context,
                    String extension) throws IOException {
                return outputFilePath;
            }
        };

        // create a record writer that will write to the desired output subfolder
        writer = tof.getRecordWriter(context);
    }

    @Override
    protected void cleanup(Context context) throws IOException,
            InterruptedException {
        writer.close(context);
    }
}

需要考虑的一些要点：

是customerx/yyyy-MM-dd路径文件或文件夹（如果是文件夹，那么你需要相应地修改 - 这个实现假定每个日期有一个文件，文件名是yyyy-MM- DD）
您可能希望查看LazyOutputFormat以防止创建空输出映射文件

Hadoop 1输入文件= 1个输出文件，仅限映射

1 个答案: