Merging multiple files into one within Hadoop

In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) – add compression using MR flags.

hadoop jar \
    $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br>
    -Dmapred.reduce.tasks=1 \
    -Dmapred.job.queue.name=$QUEUE \
    -input "$INPUT" \
    -output "$OUTPUT" \
    -mapper cat \
    -reducer cat

If you want compression add
-Dmapred.output.compress=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

Leave a Comment