Where are logs in Spark on YARN?

You can access logs through the command yarn logs -applicationId <application ID> [OPTIONS] general options are: appOwner <Application Owner> – AppOwner (assumed to be current user if not specified) containerId <Container ID> – ContainerId (must be specified if node address is specified) nodeAddress <Node Address> – NodeAddress in the format nodename:port (must be specified if … Read more

Apache Hadoop Yarn – Underutilization of cores

The problem lies not with yarn-site.xml or spark-defaults.conf but actually with the resource calculator that assigns the cores to the executors or in the case of MapReduce jobs, to the Mappers/Reducers. The default resource calculator i.e org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator uses only memory information for allocating containers and CPU scheduling is not enabled by default. To use both … Read more

Apache Spark: The number of cores vs. the number of executors

To hopefully make all of this a little more concrete, here’s a worked example of configuring a Spark app to use as much of the cluster as possible: Imagine a cluster with six nodes running NodeManagers, each equipped with 16 cores and 64GB of memory. The NodeManager capacities, yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, should probably be set … Read more

how many mappers and reduces will get created for a partitoned table in hive

Mappers: Number of mappers depends on various factors such as how the data is distributed among nodes, input format, execution engine and configuration params. See also here: https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works MR uses CombineInputFormat, while Tez uses grouped splits. Tez: set tez.grouping.min-size=16777216; — 16 MB min split set tez.grouping.max-size=1073741824; — 1 GB max split MapReduce: set mapreduce.input.fileinputformat.split.minsize=16777216; — … Read more

How to update table in Hive 0.13?

You can use row_number or full join. This is example using row_number: insert overwrite table_1 select customer_id, items, price, updated_date from ( select customer_id, items, price, updated_date, row_number() over(partition by customer_id order by new_flag desc) rn from ( select customer_id, items, price, updated_date, 0 as new_flag from table_1 union all select customer_id, items, price, updated_date, … Read more

Is it better to have one large parquet file or lots of smaller parquet files?

Aim for around 1GB per file (spark partition) (1). Ideally, you would use snappy compression (default) due to snappy compressed parquet files being splittable (2). Using snappy instead of gzip will significantly increase the file size, so if storage space is an issue, that needs to be considered. .option(“compression”, “gzip”) is the option to override … Read more

Hadoop speculative task execution

One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. Tasks may be slow for various reasons, including hardware degradation, or software mis-configuration, but the causes may be hard to detect since the tasks still … Read more