Question

我写了MapReduce program来分析这种形式的dataset个用户

UserID::Gender::Age::MoviesRated::Zip Code
1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117

我想

根据用户所属的平均年龄查找前10个邮政编码以该邮政编码，按照年龄的降序排列。前10名表示该邮政编码最年轻的10岁以上用户。

我有MapClass，CombinerClass和ReducerClass。

我的代码如下

public class TopTenYoungestAverageAgeRaters extends Configured implements Tool {
    private static TreeSet<AverageAge> top10 = new TreeSet<AverageAge>();

    public static class MapClass extends Mapper<LongWritable, Text, Text, AverageAge>
    {

        public boolean isNumeric(String value) // Checks if record is valid
        {
            try
            {
                Integer.parseInt(value);
                return true;
            }
            catch(NumberFormatException e)
            {
                return false;
            }
        }

        public AverageAge toCustomWritable(String[] line)
        {
            AverageAge record = new AverageAge(new IntWritable(Integer.parseInt(line[0])), new IntWritable(Integer.parseInt(line[2])), new Text(line[1]), new IntWritable(Integer.parseInt(line[3])), new Text(line[4]));
            return record;
        }

        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
        {
            String line = value.toString();
            String[] values = line.split("::");
            if(isNumeric(values[0]))
            {
                AverageAge customTuple = toCustomWritable(values);
                context.write(new Text(values[4]), customTuple);
            }

        }
    }

    public static class CombinerClass extends Reducer<Text, AverageAge, Text, AverageAge>
    {
        public void reduce(Text key, Iterable<AverageAge> values, Context context) throws IOException, InterruptedException
        {
            AverageAge newRecord = new AverageAge();
            long age = 0;
            int count = 0;
            for(AverageAge value:values)
            {
                age += value.getUserAge();
                count += 1;
            }
            newRecord.setZipCode(key.toString());
            newRecord.setAverageAge((double)(age/count));
            context.write(key, newRecord);
        }
    }


    public static class ReducerClass extends Reducer<Text, AverageAge, NullWritable, AverageAge>
    {

        public void reduce(Text key, Iterable<AverageAge> values, Context context) throws IOException, InterruptedException
        {

            for(AverageAge value:values)
            {
                top10.add(value);
                if(top10.size() > 10)
                    top10.remove(top10.last());
            }
        }

        protected void cleanup(Context context) throws IOException, InterruptedException
        {
            for(AverageAge avg: top10)
            {
                context.write(NullWritable.get(), avg);
            }
        }
    }

    public static void main(String[] args) throws Exception {
        // TODO Auto-generated method stub
        int res = ToolRunner.run(new Configuration(), new TopTenYoungestAverageAgeRaters(), args);
        System.exit(res);
    }

    @Override
    public int run(String[] arg0) throws Exception {
        // TODO Auto-generated method stub
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        job.setMapperClass(MapClass.class);
        job.setCombinerClass(CombinerClass.class);
        job.setReducerClass(ReducerClass.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(AverageAge.class);
        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(AverageAge.class);

        FileInputFormat.addInputPath(job, new Path(arg0[0]));
        FileOutputFormat.setOutputPath(job, new Path(arg0[1]));
        return job.waitForCompletion(true) ? 0 : 1;
    }

}

MapClass将带有zipcode的输出作为key和AverageAge（自定义可写类）写为value

CombinerClass计算属于该邮政编码的用户的平均年龄，并将key写为邮政编码，将值AverageAge写入。

ReducerClass给出（应该提供）前10个包含平均用户年龄的邮政编码，但我只获得一条记录作为输出。

我还尝试在Reducer类中执行System.out.println()以查看传递给ReducerClass的值，但console上没有打印任何值（我在eclipse中本地运行程序环境）

我是MapReduce的新手，无法弄清楚这个程序中的错误。

Dataset Source

Answer 1

问题陈述似乎是矛盾的：平均年龄下降的前十名将是10岁，而不是最年轻的10岁。最好在那里得到一些澄清。

无论如何，这里有很多很多错误。

不保证可以使用合并器
如果您有多个reducer任务，您将在不同的文件中获得最多10个输出
如上所述，您将获得的“前10名”将是10个最低的邮政编码（按字典顺序排序）。
通常cleanup()时间你不再写记录了。

你想要的是使用shuffle将具有相同zipcode的记录放在一起，并使用聚合类（Combiner和Reducer）来计算平均值。在您拥有每个邮政编码的年龄之前，无法确定“前10个”要求。但关键的一点是，为了以分布式方式计算平均值，在减少之前，你永远不会失去分母。您的机队中的组合器可能会使用相同的密钥接收记录。

Mapper记录并产生三联：

k::g::a::z |=> z |-> ( 1, a )

Combiner使用相同的键获取三元组的集合并对它们求平均值（并对分母求和）：

z |-> [ ( d1, a1 ), ..., ( dn, an ) ] |=> z |-> ( sum( di ), sum( ai ) / sum ( di ) )

Reducer采用具有相同键的三元组集合并对它们求平均值，抛出分母：

z |-> [ ( d1, a1 ), ..., ( dn, an ) ] |=> z |-> sum( ai ) / sum ( di )

无论您是否提供合并器，您的算法都应该有效;合并器是一种优化，仅适用于某些地图缩减情况。

要限制前十名，您现在需要按平均年龄重新排序结果。

这意味着另一个映射器：

z |-> avg |=> avg |-> z

还有一个只输出前10个结果的减速器（练习留给读者）。另外，只有一个减少任务，或者你将获得前10倍，其中x是减少任务的数量。

Mapreduce程序只输出一条记录

1 个答案: