Question

亲爱的小伙子：我是hadoop的新手，最近尝试实现算法。

该算法需要计算一个矩阵，表示每两对歌曲的不同等级。我已经这样做了，输出是600000 * 600000稀疏矩阵，我存储在我的HDFS中。我们称这个数据集为A（大小= 160G）

现在，我需要阅读用户的个人资料，以预测他们对特定歌曲的评分。所以我需要先读取用户的配置文件（5G大小），然后调用此数据集B，然后计算使用数据集A.

但现在我不知道如何从单个hadoop程序中读取两个数据集。或者我可以将数据集B读入RAM然后进行计算吗？（我想我不能，因为HDFS是一个分布式系统，我无法将数据集B读入单个机器的内存中。）

有什么建议吗？

Answer 1

Haddop允许您为不同的文件夹使用不同的地图输入格式。因此，您可以从多个数据源中读取，然后在Map函数中转换为scpecific类型，即在一种情况下，您在其他（String，SongSongRaiting）中获得（String，User），并且Map签名是（String，Object）。第二步是选择推荐算法，以某种方式连接这些数据，因此agregator将具有足够的信息来计算推荐。

Answer 2

您可以使用两个Map函数，每个Map函数如果要实现不同的处理，可以处理一个数据集。您需要在工作单上注册您的地图。例如：

           public static class FullOuterJoinStdDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
    {
            private String person_name, book_title,file_tag="person_book#";
            private String emit_value = new String();
            //emit_value = "";
            public void map(LongWritable key, Text values, OutputCollector<Text,Text>output, Reporter reporter)
                     throws IOException
            {
                    String line = values.toString();
                    try
                    {
                            String[] person_detail = line.split(",");
                            person_name = person_detail[0].trim();
                            book_title = person_detail[1].trim();
                    }
                    catch (ArrayIndexOutOfBoundsException e)
                    {
                            person_name = "student name missing";
                     }
                    emit_value = file_tag + person_name;
                    output.collect(new Text(book_title), new Text(emit_value));
            }

    }


       public static class FullOuterJoinResultDetMapper extends MapReduceBase implements  Mapper <LongWritable ,Text ,Text, Text>
     {
            private String author_name, book_title,file_tag="auth_book#";
            private String emit_value = new String();

// emit_value =“”; public void map（LongWritable key，Text values，OutputCollectoroutput，Reporter reporter）抛出IOException { String line = values.toString（）; 尝试 { String [] author_detail = line.split（“，”）; author_name = author_detail [1] .trim（）; book_title = author_detail [0] .trim（）; } catch（ArrayIndexOutOfBoundsException e） { author_name =“未出现在考试中”; }

                          emit_value = file_tag + author_name;                                     
                         output.collect(new Text(book_title), new Text(emit_value));
                 }

             }


       public static void main(String args[])
                    throws Exception
    {

           if(args.length !=3)
                    {
                            System.out.println("Input outpur file missing");
                            System.exit(-1);
                    }


            Configuration conf = new Configuration();
            String [] argum = new GenericOptionsParser(conf,args).getRemainingArgs();
            conf.set("mapred.textoutputformat.separator", ",");
            JobConf mrjob = new JobConf();
            mrjob.setJobName("Inner_Join");
            mrjob.setJarByClass(FullOuterJoin.class);

            MultipleInputs.addInputPath(mrjob,new Path(argum[0]),TextInputFormat.class,FullOuterJoinStdDetMapper.class);
            MultipleInputs.addInputPath(mrjob,new Path(argum[1]),TextInputFormat.class,FullOuterJoinResultDetMapper.class);

            FileOutputFormat.setOutputPath(mrjob,new Path(args[2]));
            mrjob.setReducerClass(FullOuterJoinReducer.class);

            mrjob.setOutputKeyClass(Text.class);
            mrjob.setOutputValueClass(Text.class);

            JobClient.runJob(mrjob);
    }

有关同时将两个不同数据集读入Hadoop的建议吗？

2 个答案: