All of the introductory tutorials and docs that I can find on Hadoop have simple/contrived (word count-style) examples, where each of them is submitted to MR by:
SSHing into the JobTracker node
Making sure that a JAR file containing the MR job is on HDFS
Running an HDFS command of the form bin/hadoop jar share/hadoop/mapreduce/my-map-reduce.jar <someArgs> that actually runs Hadoop/MR
Either reading the MR result from the command-line or opening a text file containing the result
Although these examples are great for showing total newbies how to work with Hadoop, it doesn't show me how Java code actually integrates with Hadoop/MR at the API level. I guess I am sort of expecting that:
Hadoop exposes some kind of client access/API for submitting MR jobs to the cluster
Once the jobs are complete, some asynchronous mechanism (callback, listener, etc.) reports the result back to the client
So, something like this (Groovy pseudo-code):
class Driver {
static void main(String[] args) {
new Driver().run(args)
}
void run(String[] args) {
MapReduceJob myBigDataComputation = new SolveTheMeaningOfLifeJob(convertToHadoopInputs(args), new MapReduceCallback() {
@Override
void onResult() {
// Now that you know the meaning of life, do nothing.
}
})
HadoopClusterClient hadoopClient = new HadoopClusterClient("http://my-hadoop.example.com/jobtracker")
hadoopClient.submit(myBigDataComputation)
}
}
So I ask: Surely the simple examples in all the introductory tutorials, where you SSH into nodes and run Hadoop from the CLI, and open text files to view its results...surely that can't be the way Big Data companies actually integrate with Hadoop. Surely, something along the lines of my pseudo-code snippet above is used to kick off an MR job and fetch its results. What is it?