hadoop - w3toppers.com

What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

That’s not the real error, here’s how to find it: Go to the hadoop jobtracker web-dashboard, find the hive mapreduce jobs that failed and look at the logs of the failed tasks. That will show you the real error. The console output errors are useless, largely beause it doesn’t have a view of the individual … Read more

Difference between HBase and Hadoop/HDFS

Hadoop is basically 3 things, a FS (Hadoop Distributed File System), a computation framework (MapReduce) and a management bridge (Yet Another Resource Negotiator). HDFS allows you store huge amounts of data in a distributed (provides faster read/write access) and redundant (provides better availability) manner. And MapReduce allows you to process this huge data in a … Read more

Difference between Pig and Hive? Why have both? [closed]

Check out this post from Alan Gates, Pig architect at Yahoo!, that compares when would use a SQL like Hive rather than Pig. He makes a very convincing case as to the usefulness of a procedural language like Pig (vs. declarative SQL) and its utility to dataflow designers.

Job queue for Hive action in oozie

A. Oozie specifics Oozie propagates the “regular” Hadoop properties to a “regular” MapReduce Action. But for other types of Action (Shell, Hive, Java, etc.) where Oozie runs a single Mapper task in YARN, it does not consider that it’s a real MapReduce job. Hence it uses a different set of undocumented properties always prefixed with … Read more

Create HIVE Table with multi character delimiter

FILELDS TERMINATED BY does not support multi-character delimiters. The easiest way to do this is to use RegexSerDe: CREATE EXTERNAL TABlE tableex(id INT, name STRING) ROW FORMAT ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’ WITH SERDEPROPERTIES ( “input.regex” = “^(\\d+)~\\*(.*)$” ) STORED AS TEXTFILE LOCATION ‘/user/myusername’;

Default number of reducers

How Many Reduces? ( From official documentation) The right number of reduces seems to be 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node). With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish … Read more

Hadoop input split size vs block size

The answer by @user1668782 is a great explanation for the question and I’ll try to give a graphical depiction of it. Assume we have a file of 400MB with consists of 4 records(e.g : csv file of 400MB and it has 4 rows, 100MB each) If the HDFS Block Size is configured as 128MB, then … Read more

What is the use of grouping comparator in hadoop map reduce

In support of the chosen answer I add: Following on from this explanation **Input**: symbol time price a 1 10 a 2 20 b 3 30 **Map output**: create composite key\values like so: > symbol-time time-price > >**a-1**         1-10 > >**a-2**         2-20 > >**b-3**         3-30 The Partitioner: will route the a-1 and a-2 keys to the same reducer … Read more

Pig Latin: Load multiple files from a date range (part of the directory structure)

As zjffdu said, the path expansion is done by the shell. One common way to solve your problem is to simply use Pig parameters (which is a good way to make your script more resuable anyway): shell: pig -f script.pig -param input=/user/training/test/{20100810..20100812} script.pig: temp = LOAD ‘$input’ USING SomeLoader() AS (…);

Hadoop: …be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation

This error is caused by the block replication system of HDFS since it could not manage to make any copies of a specific block within the focused file. Common reasons of that: Only a NameNode instance is running and it’s not in safe-mode There is no DataNode instances up and running, or some are dead. … Read more