Calling a mapreduce job from a simple java program

Oh please don’t do it with runJar, the Java API is very good.

See how you can start a job from normal code:

// create a configuration
Configuration conf = new Configuration();
// create a new job based on the configuration
Job job = new Job(conf);
// here you have to put your mapper class
// here you have to put your reducer class
// here you have to set the jar which is containing your 
// map/reduce class, so you can use the mapper class
// key/value of your reducer output
// this is setting the format of your input, can be TextInputFormat
// same with output
// here you can set the path of your input
SequenceFileInputFormat.addInputPath(job, new Path("files/toMap/"));
// this deletes possible output paths to prevent job failures
FileSystem fs = FileSystem.get(conf);
Path out = new Path("files/out/processed/");
fs.delete(out, true);
// finally set the empty out path
TextOutputFormat.setOutputPath(job, out);

// this waits until the job completes and prints debug out to STDOUT or whatever
// has been configured in your log4j properties.

If you are using an external cluster, you have to put the following infos to your configuration via:

// this should be like defined in your mapred-site.xml
conf.set("mapred.job.tracker", ""); 
// like defined in hdfs-site.xml
conf.set("", "hdfs://");

This should be no problem when the hadoop-core.jar is in your application containers classpath.
But I think you should put some kind of progress indicator to your web page, because it may take minutes to hours to complete a hadoop job 😉

For YARN (> Hadoop 2)

For YARN, the following configurations need to be set.

// this should be like defined in your yarn-site.xml
conf.set("yarn.resourcemanager.address", ""); 

// framework is now "yarn", should be defined like this in mapred-site.xm
conf.set("", "yarn");

// like defined in hdfs-site.xml
conf.set("", "hdfs://");

Leave a Comment