Question

我想编写一个包含三个Mapper的代码，其中两个将处理".csv"文件，其他文件为".xml"。我在here

中为XmlInputFormat格式撰写了.xml

现在我想知道我应该输入什么

job.setInputFormatClass(...);

还应该添加哪些文件以提供文件路径。

 TextInputFormat.addInputPath(...)
 TextOutputFormat.setInputPath(...)

OR

TextInputFormat.addInputPath(...)
TextOutputFormat.setInputPath(...)

Answer 1

您应该考虑编写两个映射器，一个处理.csv文件和另一个.xml。但是，映射器应该生成key-value same type，以便单个reducer处理它。

以下是使用org.apache.hadoop.mapred.lib.MultipleInputs进行相同的示例：

MultipleInputs.addInputPath(jobConf, 
                     new Path(csvFilePath),       
                     SequenceFileInputFormat.class, 
                     CSVProcessingMapper.class);
MultipleInputs.addInputPath(jobConf, 
                     new Path(xmlFilePath), 
                     XmlInputFormat.class, 
                     XMLProcessingMapper.class);

此处CSVProcessingMapper.class和XmlInputFormat.class是CSV和XML处理映射器。您可以使用不同输入类型的多个映射器。类似地，SequenceFileInputFormat.class和XmlInputFormat.class类是相应的输入格式类。

在MapReduce中使用不同的InputFormatClass调用多个Mapper

1 个答案: