Question

我有一个reduce方法，根据记录中的时间戳选择文件。

数据中的时间戳可以属于N个不同的日期。（比如N = 5）基于该日期，选择文件并选择具有相应路径的MapFile Writer。 N路径有N个写入器

 Example : to write record 15-02-2016,Key1,value1 
 A Map File writer object writing to basePath/15-02-2016  will be selected 
and writes key1,value1 using selected writer.

以下是reduce方法

 @Override
 protected void reduce(CompositeKey key,Iterable<SomeDataWritable> dataList,
          Reducer<CompositeKey, SomeDataWritable, Text, OutputWritable>.Context context)
          throws IOException, InterruptedException {
          for(SomeDataWritable data:dataList){
            MyMapFileWriter.write(key.getTimeStamp(),key.getId(),new OutPutWritable(data);
           }
}

MyMapFileWriter.write(long timestamp,Text key,OutPutWritable value){
writer=selectWriter(timestamp)// select writer based on timestamp
writer.append(key,value)
}

键按日（id，id）排序。分区程序基于Day，GroupingComparator基于（Day，id） 因此，对reduce的调用应该获得按ID排序的一天的所有记录。这可以直接从reduce写入文件吗？

写入映射文件的键应该按升序排列，reduce方法的多个并行调用（在同一个reducer节点上）会导致乱序密钥吗？

即使没有任何context.write in减少作业输出路径也有一些输出（我在eclipse中以本地模式运行）。这可能是映射器输出由Hadoop Reducer的reduce（）编写。我怎么能避免这个？

Answer 1

我认为直接通过作者写一些文件并不是一个好主意，因为它与与容错有关的hadoop想法不一致：你运行你的工作，一个节点失败，hadoop试图重新安排工作，但是由于您没有使用hadoop标准机制写入文件，它无法对部分结果失败做任何事情（您应该自己处理）。

根据＆＃34;无序密钥＆＃34;。我不确定我理解你的问题，但是一个reducer将处理一个密钥的数据，例如一个reducer可以处理密钥＆lt; 2016-02-02，id1＆gt;的数据，另一个reducer可以处理密钥＆lt; 2016-02-01，id2＆gt;的记录。等。

如果您理解正确，则应在配置中指定reduce output path public static async Task<string> SerializeObject<T>(T obj)，以便输入和输出路径不同。在这种情况下，您将在OUTPUT_PATH中接收与reducers相关的文件。

从reduce直接写入hadoop映射文件

1 个答案: