Merging multiple files into one within Hadoop

In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) – add compression using MR flags. hadoop jar \ $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br> -Dmapred.reduce.tasks=1 \ -Dmapred.job.queue.name=$QUEUE \ -input “$INPUT” \ -output “$OUTPUT” \ -mapper cat \ -reducer cat If you want compression … Read more

Behavior of the parameter “mapred.min.split.size” in HDFS

The split size is calculated by the formula:- max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size)) In your case it will be:- split size=max(128,min(Long.MAX_VALUE(default),64)) So above inference:- each map will process 2 hdfs blocks(assuming each block 64MB): True There will be a new division of my input file (previously included HDFS) to occupy blocks in HDFS 128M: False but making … Read more

Datanode process not running in Hadoop

You need to do something like this: bin/stop-all.sh (or stop-dfs.sh and stop-yarn.sh in the 2.x serie) rm -Rf /app/tmp/hadoop-your-username/* bin/hadoop namenode -format (or hdfs in the 2.x serie) the solution was taken from: http://pages.cs.brandeis.edu/~cs147a/lab/hadoop-troubleshooting/. Basically it consists in restarting from scratch, so make sure you won’t loose data by formating the hdfs.