How does Hadoop perform input splits?

The InputFormat is responsible to provide the splits.

In general, if you have n nodes, the HDFS will distribute the file over all these n nodes. If you start a job, there will be n mappers by default. Thanks to Hadoop, the mapper on a machine will process the part of the data that is stored on this node. I think this is called Rack awareness.

So to make a long story short: Upload the data in the HDFS and start a MR Job. Hadoop will care for the optimised execution.

Leave a Comment