我目前有两个hadoop作业,第二个作业需要输出第一个要添加到分布式缓存的作业。目前我手动运行它们,所以在第一个作业完成后,我将输出文件作为参数传递给第二个作业,并将其驱动程序添加到缓存中。
第一个工作只是一个简单的地图工作,我希望我能按顺序执行两个工作时运行一个命令。
任何人都可以帮我解决问题,将第一份作业的输出放入分布式缓存中,以便将其传递给第二份工作吗?
由于
编辑: 这是作业1的当前驱动程序:
public class PlaceDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: PlaceMapper <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "Place Mapper");
job.setJarByClass(PlaceDriver.class);
job.setMapperClass(PlaceMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
TextInputFormat.addInputPath(job, new Path(otherArgs[0]));
TextOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
这是job2的驱动程序。作业1的输出作为第一个参数传递给作业2并加载到缓存
public class LocalityDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 3) {
System.err.println("Usage: LocalityDriver <cache> <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "Job Name Here");
DistributedCache.addCacheFile(new Path(otherArgs[0]).toUri(),job.getConfiguration());
job.setNumReduceTasks(1); //TODO: Will change
job.setJarByClass(LocalityDriver.class);
job.setMapperClass(LocalityMapper.class);
job.setCombinerClass(TopReducer.class);
job.setReducerClass(TopReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
TextInputFormat.addInputPath(job, new Path(otherArgs[1]));
TextOutputFormat.setOutputPath(job, new Path(otherArgs[2]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
答案 0 :(得分:1)
在同一个main中创建两个作业对象。让第一个等待完成,然后再运行另一个。
public class DefaultTest extends Configured implements Tool{
public int run(String[] args) throws Exception {
Job job = new Job();
job.setJobName("DefaultTest-blockx15");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setNumReduceTasks(15);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(DefaultTest.class);
job.waitForCompletion(true):
job2 = new Job();
// define your second job with the input path defined as the output of the previous job.
return 0;
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
ToolRunner.run(new DefaultTest(), otherArgs);
}
}
答案 1 :(得分:0)
一个简单的答案是将两个主要方法的代码提取到两个单独的方法,例如:boolean job1()
和boolean job2()
,然后在主方法之间调用它们,如下所示:
public static void main(String[] args) throws Exception {
if (job1()) {
jobs2();
}
}
job1
和job2
来电的返回值是调用的结果job.waitForCompletion(true)
答案 2 :(得分:0)
答案 3 :(得分:0)
您还可以使用ChainMapper,JobControl和ControlledJob来控制您的工作流程
Configuration config = getConf();
Job j1 = new Job(config);
Job j2 = new Job(config);
Job j3 = new Job(config);
j1.waitForCompletion(true);
JobControl jobFlow = new JobControl("j2");
ControlledJob cj3 = new ControlledJob(j2, null);
jobFlow.addJob(cj3);
jobFlow.addJob(new ControlledJob(j2, Lists.newArrayList(cj3)));
jobFlow.addJob(new ControlledJob(j3, null));
jobFlow.run();