Question

我已经尝试了很多，但没有理解我的mapper记录输出= 0的原因。我希望我的mapper不止一次读取这些行，因为我正在处理大数据并且需要多行数据，所以我首先尝试使用包含的小文件（graph.txt） -

   1,2,4,6
   2,10,3,7
   3,6,5,8
   4,7,7,9
   5,13,9,9

但是当mapper逐行处理文件时，没有其他办法，那么当map（）方法首先被调用（n-1）次时我首先将所有值存储在文件中，然后在最后一个映射中进行处理（）方法调用。对于文件中的每一行，我将其数据存储在行数组中。在最后一个map方法中，通过output.collect（）函数调用给出输出。另外我使用setup（）方法来计算否。文件中的行数，因为每个映射器都会调用一次setup（）。这里由于输入文件较小，因此只有1个映射器被称为。

我被困在这一段时间，我是新手，请给出一些解决方案。提前致谢。这是代码。

驱动程序代码 -

    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapred.FileInputFormat;
    import org.apache.hadoop.mapred.FileOutputFormat;
    import org.apache.hadoop.mapred.JobClient;
    import org.apache.hadoop.mapred.JobConf;
    import org.apache.hadoop.mapred.TextInputFormat;
    import org.apache.hadoop.mapred.TextOutputFormat;


    public class primdriver {
    public static void main(String[] args) throws Exception {
        JobConf conf = new JobConf(primdriver.class);
        conf.setJobName("primdriver");

        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(Text.class);

        conf.setMapperClass(primmapper.class);
        //conf.setCombinerClass(Reduce.class);
        conf.setReducerClass(primreducer.class);

        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

       JobClient.runJob(conf);
      }
}

映射器代码 -

    import java.io.BufferedReader;
    import java.io.IOException;
    import java.io.InputStreamReader;

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapred.MapReduceBase;
    import org.apache.hadoop.mapred.Mapper;
    import org.apache.hadoop.mapred.OutputCollector;
    import org.apache.hadoop.mapred.Reporter;
    import org.apache.hadoop.mapreduce.Mapper.Context;

    public class primmapper extends MapReduceBase implements         
    Mapper<LongWritable, Text, Text, Text> {
        //private final static IntWritable one = new IntWritable(1);
        //private Text word = new Text();
        private int no_line=0;
        private int i=0;
        public void setup(Context context) throws IOException{
            Path pt=new Path("hdfs:/myinput/graph.txt");//Location of file         
            in HDFS
            FileSystem fs = FileSystem.get(new Configuration());
            BufferedReader br=new BufferedReader(new                 
            InputStreamReader(fs.open(pt)));
            String line;
            line=br.readLine();
            while (line != null){
                no_line=no_line+1;
                line=br.readLine();
            }
        }   
        private String [][]row=new String[no_line][4];

        @Override       
        public void map(LongWritable key, Text value, OutputCollector<Text, 
   Text> output, Reporter reporter) throws IOException {        

            if (i<no_line-1){

                String[] s = value.toString().split(",");
                for (int j=0;j<s.length;j++){
                    row[i][j]=(s[j]);
                }
                i=i+1;
            }
            else{
                String[] s = value.toString().split(",");
                for (int j=0;j<s.length;j++){
            //row[i][j]=Integer.parseInt(s[j]);
                }
                for (int i=0;i<no_line-1;i++){
                    String a=row[i][0];
                    String b=row[i][1]+","+row[i][2]+","+row[i][3];
                    output.collect(new Text(a),new Text(b));
                }
            }
        }
    }

缩减代码 -

    import java.io.IOException;
    import java.util.Iterator;

    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapred.MapReduceBase;
    import org.apache.hadoop.mapred.OutputCollector;
    import org.apache.hadoop.mapred.Reducer;
    import org.apache.hadoop.mapred.Reporter;


        public class primreducer extends MapReduceBase implements 
        Reducer<Text, Text, Text, Text> {
            public void reduce(Text key, Iterator<Text> values, 
   OutputCollector<Text, Text> output, Reporter reporter) throws IOException         
   {
        int a = 0, b = 0 , c = 0;
        output.collect(new Text("kishan "), new Text("sharma"));
        while (values.hasNext()) {
            String val[]=(values.next().toString()).split(",");
            a=Integer.parseInt(val[0]);
            b=Integer.parseInt(val[1]);
            c=Integer.parseInt(val[2]);
        }
        output.collect(key, new Text(a+","+b+","+c));
    }
}

在控制台中我得到了这个日志 -

    [training@localhost workspace]$ hadoop jar hierarchical.jar primdriver 
    myinput/graph.txt cluster5
    17/04/07 10:21:18 WARN mapred.JobClient: Use GenericOptionsParser for         
    parsing the arguments. Applications should implement Tool for the same.
    17/04/07 10:21:18 WARN snappy.LoadSnappy: Snappy native library is available
    17/04/07 10:21:18 INFO snappy.LoadSnappy: Snappy native library loaded
    17/04/07 10:21:18 INFO mapred.FileInputFormat: Total input paths to process : 1
    17/04/07 10:21:18 INFO mapred.JobClient: Running job: job_201704070816_0007
    17/04/07 10:21:19 INFO mapred.JobClient:  map 0% reduce 0%
    17/04/07 10:22:21 INFO mapred.JobClient:  map 100% reduce 0%
    17/04/07 10:22:29 INFO mapred.JobClient:  map 100% reduce 66%
    17/04/07 10:22:53 INFO mapred.JobClient:  map 100% reduce 100%
    17/04/07 10:23:22 INFO mapred.JobClient: Job complete: job_201704070816_0007
    17/04/07 10:23:22 INFO mapred.JobClient: Counters: 33
    17/04/07 10:23:22 INFO mapred.JobClient:   File System Counters
    17/04/07 10:23:22 INFO mapred.JobClient:     FILE: Number of bytes read=6
    17/04/07 10:23:22 INFO mapred.JobClient:     FILE: Number of bytes written=361924
    17/04/07 10:23:22 INFO mapred.JobClient:     FILE: Number of read operations=0
    17/04/07 10:23:22 INFO mapred.JobClient:     FILE: Number of large read operations=0
    17/04/07 10:23:22 INFO mapred.JobClient:     FILE: Number of write operations=0
    17/04/07 10:23:22 INFO mapred.JobClient:     HDFS: Number of bytes read=146
    17/04/07 10:23:22 INFO mapred.JobClient:     HDFS: Number of bytes written=0
    17/04/07 10:23:22 INFO mapred.JobClient:     HDFS: Number of read operations=3
    17/04/07 10:23:22 INFO mapred.JobClient:     HDFS: Number of large read operations=0
    17/04/07 10:23:22 INFO mapred.JobClient:     HDFS: Number of write operations=2
    17/04/07 10:23:22 INFO mapred.JobClient:   Job Counters 
    17/04/07 10:23:22 INFO mapred.JobClient:     Launched map tasks=1
    17/04/07 10:23:22 INFO mapred.JobClient:     Launched reduce tasks=1
    17/04/07 10:23:22 INFO mapred.JobClient:     Data-local map tasks=1
    17/04/07 10:23:22 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=90240
    17/04/07 10:23:22 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=31777
    17/04/07 10:23:22 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
    17/04/07 10:23:22 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
    17/04/07 10:23:22 INFO mapred.JobClient:   Map-Reduce Framework
    17/04/07 10:23:22 INFO mapred.JobClient:     Map input records=5
    17/04/07 10:23:22 INFO mapred.JobClient:     Map output records=0
    17/04/07 10:23:22 INFO mapred.JobClient:     Map output bytes=0
    17/04/07 10:23:22 INFO mapred.JobClient:     Input split bytes=104
    17/04/07 10:23:22 INFO mapred.JobClient:     Combine input records=0
    17/04/07 10:23:22 INFO mapred.JobClient:     Combine output records=0
    17/04/07 10:23:22 INFO mapred.JobClient:     Reduce input groups=0
    17/04/07 10:23:22 INFO mapred.JobClient:     Reduce shuffle bytes=6
    17/04/07 10:23:22 INFO mapred.JobClient:     Reduce input records=0
    17/04/07 10:23:22 INFO mapred.JobClient:     Reduce output records=0
    17/04/07 10:23:22 INFO mapred.JobClient:     Spilled Records=0
    17/04/07 10:23:22 INFO mapred.JobClient:     CPU time spent (ms)=1240
    17/04/07 10:23:22 INFO mapred.JobClient:     Physical memory (bytes) snapshot=196472832
    17/04/07 10:23:22 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=775897088
    17/04/07 10:23:22 INFO mapred.JobClient:     Total committed heap usage (bytes)=177016832
    17/04/07 10:23:22 INFO mapred.JobClient:   org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
    17/04/07 10:23:22 INFO mapred.JobClient:     BYTES_READ=42

Answer 1

当i小于no_line-1时，您不会收集任何内容。这种情况总是存在于你的情况下，这就是你没有看到任何地图输出记录的原因。

当您开始处理第一条记录时，no_line已经初始化为其最终值（输入文件“hdfs：/myinput /graph.txt”中的实际行数）。

此时，i为0.然后，当满足此if条件时，i在此特定映射器中变为1 （不是所有映射器）。*然后，i的值为1（在此映射器中），且必须小于no_line - 1。您的文件graph.txt似乎超过5行（我猜）。

总之，setup()在每个映射器执行map()之前执行一次。

我不知道你想做什么，这部分似乎很难理解。如果您需要更多帮助，请尝试使其更清晰，并使用更多详细信息更新您的问题。在else语句中，再次使用变量i似乎非常混乱，因为不清楚您是否真的想要使用本地i或“阴影”i。你的IDE不会发出警告吗？

*这是一种非常糟糕的做法，因为您无法知道每个映射器中将采用哪些值i，这取决于数据分区。

为什么Map记录输出= 0，即使我在mapper

1 个答案: