mapreduce - w3toppers.com

Reduce a key-value pair into a key-list pair with Apache Spark

Map and ReduceByKey Input type and output type of reduce must be the same, therefore if you want to aggregate a list, you have to map the input to lists. Afterwards you combine the lists into one list. Combining lists You’ll need a method to combine lists into one list. Python provides some methods to … Read more

hadoop map reduce secondary sorting

I find it easy to understand certain concepts with help of diagrams and this is certainly one of them. Lets assume that our secondary sorting is on a composite key made out of Last Name and First Name. With the composite key out of the way, now lets look at the secondary sorting mechanism The … Read more

how many mappers and reduces will get created for a partitoned table in hive

Mappers: Number of mappers depends on various factors such as how the data is distributed among nodes, input format, execution engine and configuration params. See also here: https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works MR uses CombineInputFormat, while Tez uses grouped splits. Tez: set tez.grouping.min-size=16777216; — 16 MB min split set tez.grouping.max-size=1073741824; — 1 GB max split MapReduce: set mapreduce.input.fileinputformat.split.minsize=16777216; — … Read more

Hive unable to manually set number of reducers

writing query in hive like this: SELECT COUNT(DISTINCT id) …. will always result in using only one reducer. You should: use this command to set desired number of reducers: set mapred.reduce.tasks=50 rewrite query as following: SELECT COUNT(*) FROM ( SELECT DISTINCT id FROM … ) t; This will result in 2 map+reduce jobs instead of … Read more

Hadoop speculative task execution

One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. Tasks may be slow for various reasons, including hardware degradation, or software mis-configuration, but the causes may be hard to detect since the tasks still … Read more

MongoDB Stored Procedure Equivalent

The closest thing to an equivalent of a stored procedure in mongodb is stored javascript. A good introduction to stored javascript is available in this article on Mike Dirolf’s blog.

How to get the input file name in the mapper in a Hadoop program?

First you need to get the input split, using the newer mapreduce API it would be done as follows: context.getInputSplit(); But in order to get the file path and the file name you will need to first typecast the result into FileSplit. So, in order to get the input file path you may do the … Read more

Specify minimum number of generated files from Hive insert

The number of files generated during INSERT … SELECT depends on the number of processes running on final reducer (final reducer vertex if you are running on Tez) plus bytes per reducer configured. If the table is partitioned and there is no DISTRIBUTE BY specified, then in the worst case each reducer creates files in … Read more

data block size in HDFS, why 64MB?

What does 64MB block size mean? The block size is the smallest data unit that a file system can store. If you store a file that’s 1k or 60Mb, it’ll take up one block. Once you cross the 64Mb boundary, you need a second block. If yes, what is the advantage of doing that? HDFS … Read more

Find all duplicate documents in a MongoDB collection by a key field

The accepted answer is terribly slow on large collections, and doesn’t return the _ids of the duplicate records. Aggregation is much faster and can return the _ids: db.collection.aggregate([ { $group: { _id: { name: “$name” }, // replace `name` here twice uniqueIds: { $addToSet: “$_id” }, count: { $sum: 1 } } }, { $match: … Read more