hdfs - w3toppers.com

How do I Combine or Merge Small ORC files into Larger ORC file?

You do not need to re-invent the wheel. ALTER TABLE table_name [PARTITION partition_spec] CONCATENATE can be used to merge small ORC files into a larger file since Hive 0.14.0. The merge happens at the stripe level, which avoids decompressing and decoding the data. It works fast. I’d suggest to create an external table partitioned by … Read more

Behavior of the parameter “mapred.min.split.size” in HDFS

The split size is calculated by the formula:- max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size)) In your case it will be:- split size=max(128,min(Long.MAX_VALUE(default),64)) So above inference:- each map will process 2 hdfs blocks(assuming each block 64MB): True There will be a new division of my input file (previously included HDFS) to occupy blocks in HDFS 128M: False but making … Read more

Store images/videos into Hadoop HDFS

It is absolutely possible without doing anything extra. Hadoop provides us the facility to read/write binary files. So, practically anything which can be converted into bytes can be stored into HDFS(images, videos etc). To do that Hadoop provides something called as SequenceFiles. SequenceFile is a flat file consisting of binary key/value pairs. The SequenceFile provides … Read more

Difference between hadoop fs -put and hadoop fs -copyFromLocal

-copyFromLocal is similar to -put command, except that the source is restricted to a local file reference. So basically, you can do with put, all that you do with -copyFromLocal, but not vice-versa. Similarly, -copyToLocal is similar to get command, except that the destination is restricted to a local file reference. Hence, you can use … Read more

Difference between HBase and Hadoop/HDFS

Hadoop is basically 3 things, a FS (Hadoop Distributed File System), a computation framework (MapReduce) and a management bridge (Yet Another Resource Negotiator). HDFS allows you store huge amounts of data in a distributed (provides faster read/write access) and redundant (provides better availability) manner. And MapReduce allows you to process this huge data in a … Read more

What are the pros and cons of parquet format compared to other formats?

I think the main difference I can describe relates to record oriented vs. column oriented formats. Record oriented formats are what we’re all used to — text files, delimited formats like CSV, TSV. AVRO is slightly cooler than those because it can change schema over time, e.g. adding or removing columns from a record. Other … Read more

Default number of reducers

How Many Reduces? ( From official documentation) The right number of reduces seems to be 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node). With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish … Read more

How to list all files in a directory and its subdirectories in hadoop hdfs

If you are using hadoop 2.* API there are more elegant solutions: Configuration conf = getConf(); Job job = Job.getInstance(conf); FileSystem fs = FileSystem.get(conf); //the second boolean parameter here sets the recursion to true RemoteIterator<LocatedFileStatus> fileStatusListIterator = fs.listFiles( new Path(“path/to/lib”), true); while(fileStatusListIterator.hasNext()){ LocatedFileStatus fileStatus = fileStatusListIterator.next(); //do stuff with the file like … job.addFileToClassPath(fileStatus.getPath()); }

Hadoop: …be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation

This error is caused by the block replication system of HDFS since it could not manage to make any copies of a specific block within the focused file. Common reasons of that: Only a NameNode instance is running and it’s not in safe-mode There is no DataNode instances up and running, or some are dead. … Read more

How does Hadoop perform input splits?

The InputFormat is responsible to provide the splits. In general, if you have n nodes, the HDFS will distribute the file over all these n nodes. If you start a job, there will be n mappers by default. Thanks to Hadoop, the mapper on a machine will process the part of the data that is … Read more