Question

在我的一个MapReduce任务中，我将BytesWritable重写为KeyBytesWritable，并将ByteWritable重写为ValueBytesWritable。然后我使用SequenceFileOutputFormat输出结果。

我的问题是，当我开始下一个MapReduce任务时，我想将此SequenceFile用作输入文件。那么我怎么能设置jobclass，以及Mapper类如何识别我之前覆盖的SequenceFile中的键和值呢？

我知道我可以使用SequenceFile.Reader来读取键和值。

Configuration config = new Configuration();
Path path = new Path(PATH_TO_YOUR_FILE);
SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(config), path, config);
WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();
while (reader.next(key, value))

但我不知道如何使用此Reader将键和值作为参数传递给Mapper类。我如何将conf.setInputFormat设置为SequenceFileInputFormat，然后让Mapper获取键和值？

谢谢

Answer 1

您无需手动读取序列文件。只需设置输入格式类到序列文件：

job.setInputFormatClass(SequenceFileInputFormat.class);

并将输入路径设置为包含yor序列文件的目录。

FileInputFormat.setInputPaths(<path to the dir containing your sequence files>);

您需要注意Mapper类的参数化类型的输入的（Key，Value）类型，以匹配序列文件中的（键，值）元组。

Mapper类如何在hadoop中将SequenceFile标识为输入文件？

1 个答案: