Skip to content

Programming
- javascript
- c
- java
- c#
- c++
- php
- r
android

Hadoop input split size vs block size

April 25, 2023 by Tarik Billa

The answer by @user1668782 is a great explanation for the question and I’ll try to give a graphical depiction of it.
Assume we have a file of 400MB with consists of 4 records(e.g : csv file of 400MB and it has 4 rows, 100MB each)

If the HDFS Block Size is configured as 128MB, then the 4 records will not be distributed among the blocks evenly. It will look like this.

Block 1 contains the entire first record and a 28MB chunk of the second record.
If a mapper is to be run on Block 1, the mapper cannot process since it won’t have the entire second record.
This is the exact problem that input splits solve. Input splits respects logical record boundaries.
Lets Assume the input split size is 200MB

Therefore the input split 1 should have both the record 1 and record 2. And input split 2 will not start with the record 2 since record 2 has been assigned to input split 1. Input split 2 will start with record 3.
This is why an input split is only a logical chunk of data. It points to start and end locations with in blocks.

Hope this helps.

More Related Contents:

How does Hadoop process records split across block boundaries?
merge output files after reduce phase
Container is running beyond memory limits
Setting the number of map tasks and reduce tasks
Is it better to use the mapred or the mapreduce package to create a Hadoop Job?
Chaining multiple MapReduce jobs in Hadoop
When do reduce tasks start in Hadoop?
How to get the input file name in the mapper in a Hadoop program?
Hadoop speculative task execution
Hive unable to manually set number of reducers
how many mappers and reduces will get created for a partitoned table in hive
hadoop map reduce secondary sorting
Where does hadoop mapreduce framework send my System.out.print() statements ? (stdout)
How does Hadoop perform input splits?
What is the use of grouping comparator in hadoop map reduce
Default number of reducers
What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
What is Google’s Dremel? How is it different from Mapreduce?
Hadoop WordCount example stuck at map 100% reduce 0%
Can Hive recursively descend into subdirectories without partitions or editing hive-site.xml?
hadoop: difference between 0 reducer and identity reducer?
Begenner at spark Big data programming (spark code)
Failed to locate the winutils binary in the hadoop binary path
What is the difference between partitioning and bucketing a table in Hive ?
Oozie: Launch Map-Reduce from Oozie action?
data block size in HDFS, why 64MB?
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
Is it better to have one large parquet file or lots of smaller parquet files?
Easiest way to install Python dependencies on Spark executor nodes?
Change File Split size in Hadoop

Categories hadoop Tags hadoop, mapreduce

Firebase console: How to specify click_action for notifications

Asynchronous HTTP Client for Java

Leave a Comment Cancel reply

Comment

Name Email Website

Save my name, email, and website in this browser for the next time I comment.

Search

How to call a method in another class in Java?
:nth-letter pseudo-element is not working [closed]
How do I change the MessageBox location?
htaccess redirect for non-www both http and https
SQL add filter only if a variable is not null
Xcode 4 – clang error
How to parse a boolean expression and load it into a class?
Group and count by month
Remove XML Node using java parser
Remote debugging C++ applications with Eclipse CDT/RSE/RDT

© 2024 w3toppers.com