hadoop - w3toppers.com

Where are logs in Spark on YARN?

You can access logs through the command yarn logs -applicationId <application ID> [OPTIONS] general options are: appOwner <Application Owner> – AppOwner (assumed to be current user if not specified) containerId <Container ID> – ContainerId (must be specified if node address is specified) nodeAddress <Node Address> – NodeAddress in the format nodename:port (must be specified if … Read more

Apache Hadoop Yarn – Underutilization of cores

The problem lies not with yarn-site.xml or spark-defaults.conf but actually with the resource calculator that assigns the cores to the executors or in the case of MapReduce jobs, to the Mappers/Reducers. The default resource calculator i.e org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator uses only memory information for allocating containers and CPU scheduling is not enabled by default. To use both … Read more

Apache Spark: The number of cores vs. the number of executors

To hopefully make all of this a little more concrete, here’s a worked example of configuring a Spark app to use as much of the cluster as possible: Imagine a cluster with six nodes running NodeManagers, each equipped with 16 cores and 64GB of memory. The NodeManager capacities, yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, should probably be set … Read more

hadoop map reduce secondary sorting

I find it easy to understand certain concepts with help of diagrams and this is certainly one of them. Lets assume that our secondary sorting is on a composite key made out of Last Name and First Name. With the composite key out of the way, now lets look at the secondary sorting mechanism The … Read more

how many mappers and reduces will get created for a partitoned table in hive

Mappers: Number of mappers depends on various factors such as how the data is distributed among nodes, input format, execution engine and configuration params. See also here: https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works MR uses CombineInputFormat, while Tez uses grouped splits. Tez: set tez.grouping.min-size=16777216; — 16 MB min split set tez.grouping.max-size=1073741824; — 1 GB max split MapReduce: set mapreduce.input.fileinputformat.split.minsize=16777216; — … Read more

How to update table in Hive 0.13?

You can use row_number or full join. This is example using row_number: insert overwrite table_1 select customer_id, items, price, updated_date from ( select customer_id, items, price, updated_date, row_number() over(partition by customer_id order by new_flag desc) rn from ( select customer_id, items, price, updated_date, 0 as new_flag from table_1 union all select customer_id, items, price, updated_date, … Read more

Hive unable to manually set number of reducers

writing query in hive like this: SELECT COUNT(DISTINCT id) …. will always result in using only one reducer. You should: use this command to set desired number of reducers: set mapred.reduce.tasks=50 rewrite query as following: SELECT COUNT(*) FROM ( SELECT DISTINCT id FROM … ) t; This will result in 2 map+reduce jobs instead of … Read more

Is it better to have one large parquet file or lots of smaller parquet files?

Aim for around 1GB per file (spark partition) (1). Ideally, you would use snappy compression (default) due to snappy compressed parquet files being splittable (2). Using snappy instead of gzip will significantly increase the file size, so if storage space is an issue, that needs to be considered. .option(“compression”, “gzip”) is the option to override … Read more

Hadoop speculative task execution

One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. Tasks may be slow for various reasons, including hardware degradation, or software mis-configuration, but the causes may be hard to detect since the tasks still … Read more

How to get the input file name in the mapper in a Hadoop program?

First you need to get the input split, using the newer mapreduce API it would be done as follows: context.getInputSplit(); But in order to get the file path and the file name you will need to first typecast the result into FileSplit. So, in order to get the input file path you may do the … Read more