Behavior of the parameter “mapred.min.split.size” in HDFS

The split size is calculated by the formula:- max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size)) In your case it will be:- split size=max(128,min(Long.MAX_VALUE(default),64)) So above inference:- each map will process 2 hdfs blocks(assuming each block 64MB): True There will be a new division of my input file (previously included HDFS) to occupy blocks in HDFS 128M: False but making … Read more

Store images/videos into Hadoop HDFS

It is absolutely possible without doing anything extra. Hadoop provides us the facility to read/write binary files. So, practically anything which can be converted into bytes can be stored into HDFS(images, videos etc). To do that Hadoop provides something called as SequenceFiles. SequenceFile is a flat file consisting of binary key/value pairs. The SequenceFile provides … Read more

Difference between HBase and Hadoop/HDFS

Hadoop is basically 3 things, a FS (Hadoop Distributed File System), a computation framework (MapReduce) and a management bridge (Yet Another Resource Negotiator). HDFS allows you store huge amounts of data in a distributed (provides faster read/write access) and redundant (provides better availability) manner. And MapReduce allows you to process this huge data in a … Read more

Default number of reducers

How Many Reduces? ( From official documentation) The right number of reduces seems to be 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node). With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish … Read more

How to list all files in a directory and its subdirectories in hadoop hdfs

If you are using hadoop 2.* API there are more elegant solutions: Configuration conf = getConf(); Job job = Job.getInstance(conf); FileSystem fs = FileSystem.get(conf); //the second boolean parameter here sets the recursion to true RemoteIterator<LocatedFileStatus> fileStatusListIterator = fs.listFiles( new Path(“path/to/lib”), true); while(fileStatusListIterator.hasNext()){ LocatedFileStatus fileStatus = fileStatusListIterator.next(); //do stuff with the file like … job.addFileToClassPath(fileStatus.getPath()); }

Hadoop: …be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation

This error is caused by the block replication system of HDFS since it could not manage to make any copies of a specific block within the focused file. Common reasons of that: Only a NameNode instance is running and it’s not in safe-mode There is no DataNode instances up and running, or some are dead. … Read more