使用键映射为Text不起作用,但将其解析为Intwritable可以正常工作

时间:2016-11-28 05:48:46

标签: hadoop mapreduce

我是hadoop的初学者并且学习我开始在两张桌子上进行外连接。 一个有关于电影和其他表的详细信息有评级。

电影表的样本数据

movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller

评级的样本数据

userId,movieId,rating,timestamp
1,122,2.0,945544824
1,172,1.0,945544871
1,1221,5.0,945544788
1,1441,4.0,945544871
1,1609,3.0,945544824
1,1961,3.0,945544871
1,1972,1.0,945544871
2,441,2.0,1008942733
2,494,2.0,1008942733
2,1193,4.0,1008942667
2,1597,3.0,1008942773
2,1608,3.0,1008942733
2,1641,4.0,1008942733

MovieId是电影表中的主键和评级表中的外键。所以在mapper类中使用movieId作为键。我使用了两个映射器,一个用于movieId表,另一个用于评级表。

  

我写的代码

public class Join {

public static class MovMapper
extends Mapper<Object, Text, Text, Text>{
    private Text word = new Text();
    public void map(Object key, Text value, Context context
            ) throws IOException, InterruptedException {
        String[] arr= value.toString().split(",");
        word.set(arr[0]);
        //System.out.println(word.toString()+ " mov");
        context.write(word, value);
    }

}

public static class RatMapper
extends Mapper<Object, Text, Text, Text>{
    private Text word = new Text();

    public void map(Object key, Text value, Context context
            ) throws IOException, InterruptedException {
        String[] arr= value.toString().split(",");
        word.set(arr[1]);
        //System.out.println(word.toString() + " rat");
        context.write(word, value);
    }

}

public static class JoinReducer
extends Reducer<Text,Text,Text,Text> {
    public void reduce(Text key, Iterable<Text> values,
            Context context
            ) throws IOException, InterruptedException {

        List <Text> rat=new ArrayList<Text>();
        Text mov= null;
        System.out.println("#######################################################################################");
        for(Text item:values){
            if(item.toString().split(",").length == 3){
                mov= new Text(item);
            }
            else
                rat.add(new Text(item));
                System.out.println("---->" + item);
        }
        System.out.println("item cnt: "+rat.size()+" mov"+mov+" key"+key+" byte: "+key.getBytes().toString());
        for(Text item:rat){
            if(mov != null) {
                context.write(item,mov);
            }
        }
        System.out.println("#######################################################################################");
    }
}

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "join");
    job.setJarByClass(Join.class);
    job.setCombinerClass(JoinReducer.class);
    job.setReducerClass(JoinReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
    MultipleInputs.addInputPath(job, new Path(args[0]),TextInputFormat.class,MovMapper.class);
    MultipleInputs.addInputPath(job, new Path(args[1]),TextInputFormat.class,RatMapper.class);
    FileOutputFormat.setOutputPath(job, new Path(args[2]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}

问题

在映射时,电影表和评级表中的记录会映射到不同的任务,尽管movieId是相同的。令人惊讶的是当我将movieId转换为可写入时,来自与密钥匹配的两个表的记录都被映射到同一个任务。

0 个答案:

没有答案