Difference between HBase and Hadoop/HDFS

Hadoop is basically 3 things, a FS (Hadoop Distributed File System), a computation framework (MapReduce) and a management bridge (Yet Another Resource Negotiator). HDFS allows you store huge amounts of data in a distributed (provides faster read/write access) and redundant (provides better availability) manner. And MapReduce allows you to process this huge data in a … Read more

Job queue for Hive action in oozie

A. Oozie specifics Oozie propagates the “regular” Hadoop properties to a “regular” MapReduce Action. But for other types of Action (Shell, Hive, Java, etc.) where Oozie runs a single Mapper task in YARN, it does not consider that it’s a real MapReduce job. Hence it uses a different set of undocumented properties always prefixed with … Read more

Create HIVE Table with multi character delimiter

FILELDS TERMINATED BY does not support multi-character delimiters. The easiest way to do this is to use RegexSerDe: CREATE EXTERNAL TABlE tableex(id INT, name STRING) ROW FORMAT ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’ WITH SERDEPROPERTIES ( “input.regex” = “^(\\d+)~\\*(.*)$” ) STORED AS TEXTFILE LOCATION ‘/user/myusername’;

Default number of reducers

How Many Reduces? ( From official documentation) The right number of reduces seems to be 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node). With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish … Read more

What is the use of grouping comparator in hadoop map reduce

In support of the chosen answer I add: Following on from this explanation **Input**: symbol time price a 1 10 a 2 20 b 3 30 **Map output**: create composite key\values like so: > symbol-time time-price > >**a-1**         1-10 > >**a-2**         2-20 > >**b-3**         3-30 The Partitioner: will route the a-1 and a-2 keys to the same reducer … Read more

Hadoop: …be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation

This error is caused by the block replication system of HDFS since it could not manage to make any copies of a specific block within the focused file. Common reasons of that: Only a NameNode instance is running and it’s not in safe-mode There is no DataNode instances up and running, or some are dead. … Read more