并行化MapReduce

时间:2017-05-01 10:49:39

标签: java multithreading parallel-processing mapreduce distributed-computing

我是Parallel Programming和Hadoop MapReduce的新手。以下示例来自Tutorial网站。

https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm

如何将MapReduce(应用并行编程)与Mapper和Reducer并行化,以便它可以一起运行以及如何引入多线程?

是否可以在一台机器上运行Mapper而在另一台机器上运行Reducer?

如果我不能很好地解释,请道歉。

 package hadoop; 

 import java.util.*; 

 import java.io.IOException; 
 import java.io.IOException; 

 import org.apache.hadoop.fs.Path; 
 import org.apache.hadoop.conf.*; 
 import org.apache.hadoop.io.*; 
 import org.apache.hadoop.mapred.*; 
 import org.apache.hadoop.util.*; 

 public class ProcessUnits 
 { 
   //Mapper class 
   public static class E_EMapper extends MapReduceBase implements 
   Mapper<LongWritable ,/*Input key Type */ 
   Text,                /*Input value Type*/ 
   Text,                /*Output key Type*/ 
   IntWritable>        /*Output value Type*/ 
   { 

      //Map function 
      public void map(LongWritable key, Text value, 
      OutputCollector<Text, IntWritable> output,   
      Reporter reporter) throws IOException 
      { 
         String line = value.toString(); 
         String lasttoken = null; 
         StringTokenizer s = new StringTokenizer(line,"\t"); 
         String year = s.nextToken(); 

         while(s.hasMoreTokens())
            {
               lasttoken=s.nextToken();
            } 

         int avgprice = Integer.parseInt(lasttoken); 
         output.collect(new Text(year), new IntWritable(avgprice)); 
      } 
   } 


   //Reducer class 
   public static class E_EReduce extends MapReduceBase implements 
   Reducer< Text, IntWritable, Text, IntWritable > 
   {  

      //Reduce function 
      public void reduce( Text key, Iterator <IntWritable> values, 
         OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException 
         { 
            int maxavg=30; 
            int val=Integer.MIN_VALUE; 

            while (values.hasNext()) 
            { 
               if((val=values.next().get())>maxavg) 
               { 
                  output.collect(key, new IntWritable(val)); 
               } 
            } 

         } 
   }  


   //Main function 
   public static void main(String args[])throws Exception 
   { 
      JobConf conf = new JobConf(ProcessUnits.class); 

      conf.setJobName("max_eletricityunits"); 
      conf.setOutputKeyClass(Text.class);
      conf.setOutputValueClass(IntWritable.class); 
      conf.setMapperClass(E_EMapper.class); 
      conf.setCombinerClass(E_EReduce.class); 
      conf.setReducerClass(E_EReduce.class); 
      conf.setInputFormat(TextInputFormat.class); 
      conf.setOutputFormat(TextOutputFormat.class); 

      FileInputFormat.setInputPaths(conf, new Path(args[0])); 
      FileOutputFormat.setOutputPath(conf, new Path(args[1])); 

      JobClient.runJob(conf); 
   } 
} 

1 个答案:

答案 0 :(得分:1)

Hadoop将为您提供并行工作;除了运行hadoop jar之外,你不应该做任何事情。

一般情况下,对于mapreduce,您应该记住,map阶段和reduce阶段是顺序发生的(不是并行),因为reduce依赖于map的结果。但是,您可以并行运行多个mappers,一旦完成,并行运行多个reducers(当然,这取决于任务)。再次,hadoop将负责为您启动和协调这些。

mapreduce phases