How does Hadoop actually accept MR jobs and input data?

时间:2015-06-25 18:57:11

标签: java hadoop

All of the introductory tutorials and docs that I can find on Hadoop have simple/contrived (word count-style) examples, where each of them is submitted to MR by: SSHing into the JobTracker node Making sure that a JAR file containing the MR job is on HDFS Running an HDFS command of the form bin/hadoop jar share/hadoop/mapreduce/my-map-reduce.jar <someArgs> that actually runs Hadoop/MR Either reading the MR result from the command-line or opening a text file containing the result Although these examples are great for showing total newbies how to work with Hadoop, it doesn't show me how Java code actually integrates with Hadoop/MR at the API level. I guess I am sort of expecting that: Hadoop exposes some kind of client access/API for submitting MR jobs to the cluster Once the jobs are complete, some asynchronous mechanism (callback, listener, etc.) reports the result back to the client So, something like this (Groovy pseudo-code): class Driver { static void main(String[] args) { new Driver().run(args) } void run(String[] args) { MapReduceJob myBigDataComputation = new SolveTheMeaningOfLifeJob(convertToHadoopInputs(args), new MapReduceCallback() { @Override void onResult() { // Now that you know the meaning of life, do nothing. } }) HadoopClusterClient hadoopClient = new HadoopClusterClient("http://my-hadoop.example.com/jobtracker") hadoopClient.submit(myBigDataComputation) } } So I ask: Surely the simple examples in all the introductory tutorials, where you SSH into nodes and run Hadoop from the CLI, and open text files to view its results...surely that can't be the way Big Data companies actually integrate with Hadoop. Surely, something along the lines of my pseudo-code snippet above is used to kick off an MR job and fetch its results. What is it?

1 个答案:

答案 0 :(得分:1)

总之,使用Oozie调度程序可以完成MR工作。但在此之前,你写了一个地图减少工作。它有驱动程序类,这是工作的起点。您提供作业在Driver类中运行所需的所有信息:如map输入,mapper类,任何分区器,配置详细信息和reducer详细信息。

一旦这些存在于jar文件中,并且您使用CLI启动上述工作(hadoop jar)(实际上oozie就是这样),其余部分由Hadoop生态系统处理。希望我回答你的问题