Question

我目前正在使用hadoop处理大型数据集的计数任务（大约3演出）。我需要计算在同一时段填充的内容。例如：计算在两者之间有时间标记的记录数凌晨3点和凌晨4点。我需要输出是连续的。如果在某个时隙中没有记录，我仍然希望将它保存在输出中，如[3 am~4am，0 record]。

为了实现这一点，我提出了一个想法，即我可以在映射任务开始之前将所有值为0的时隙放入映射输出中。但我试过谷歌，我找不到解决方案。

那么在地图任务开始之前我有什么办法可以输出一些内容吗？我也会感谢任何有关实现目标的新想法。感谢

Answer 1

解决方案是与计数任务同时进行，以节省时间。

假设您有以下文件/表

日期时间产品价值
  2016年14:00，三星，100
  2016年15:30，LG，130
  2016年，15：59，Nexus，50
  2016年，18：10，LG，15

并且您希望按产品分组并找出每种产品的总数，但同时您要计算在凌晨3点到凌晨4点之间有时间标记的记录数

只需在映射器类中定义自定义键并将其发送到上下文，具体取决于您的条件

public class Mapper_WordsCount extends Mapper<LongWritable, Text, Text, IntWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {


        String line = value.toString();
        String[] items = line.split(",")


        //define a custom key
        String myCustomKey = "3am-4am";

        //put 0 by default. Only if you want to have result like: [3am-4am, 0 records]
        context.write(new Text(myCustomKey),IntWritable(0));

        // check you condition
        if ( isBetween_3am_and_4am( items[1] ) ) {    //  write your function 
            //count the record like you want
            context.write(new Text(myCustomKey),one);
        }

        ......
        your java code
        ......

    }
}

在地图任务开始之前做一些事情

1 个答案: