hadoop - w3toppers.com

Cannot Read a file from HDFS using Spark

Here is the solution sc.textFile(“hdfs://nn1home:8020/input/war-and-peace.txt”) How did I find out nn1home:8020? Just search for the file core-site.xml and look for xml element fs.defaultFS

Datanode process not running in Hadoop

You need to do something like this: bin/stop-all.sh (or stop-dfs.sh and stop-yarn.sh in the 2.x serie) rm -Rf /app/tmp/hadoop-your-username/* bin/hadoop namenode -format (or hdfs in the 2.x serie) the solution was taken from: http://pages.cs.brandeis.edu/~cs147a/lab/hadoop-troubleshooting/. Basically it consists in restarting from scratch, so make sure you won’t loose data by formating the hdfs.

Difference between hadoop fs -put and hadoop fs -copyFromLocal

-copyFromLocal is similar to -put command, except that the source is restricted to a local file reference. So basically, you can do with put, all that you do with -copyFromLocal, but not vice-versa. Similarly, -copyToLocal is similar to get command, except that the destination is restricted to a local file reference. Hence, you can use … Read more

What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

That’s not the real error, here’s how to find it: Go to the hadoop jobtracker web-dashboard, find the hive mapreduce jobs that failed and look at the logs of the failed tasks. That will show you the real error. The console output errors are useless, largely beause it doesn’t have a view of the individual … Read more

Difference between HBase and Hadoop/HDFS

Hadoop is basically 3 things, a FS (Hadoop Distributed File System), a computation framework (MapReduce) and a management bridge (Yet Another Resource Negotiator). HDFS allows you store huge amounts of data in a distributed (provides faster read/write access) and redundant (provides better availability) manner. And MapReduce allows you to process this huge data in a … Read more

What are the pros and cons of parquet format compared to other formats?

I think the main difference I can describe relates to record oriented vs. column oriented formats. Record oriented formats are what we’re all used to — text files, delimited formats like CSV, TSV. AVRO is slightly cooler than those because it can change schema over time, e.g. adding or removing columns from a record. Other … Read more

Difference between Pig and Hive? Why have both? [closed]

Check out this post from Alan Gates, Pig architect at Yahoo!, that compares when would use a SQL like Hive rather than Pig. He makes a very convincing case as to the usefulness of a procedural language like Pig (vs. declarative SQL) and its utility to dataflow designers.

Job queue for Hive action in oozie

A. Oozie specifics Oozie propagates the “regular” Hadoop properties to a “regular” MapReduce Action. But for other types of Action (Shell, Hive, Java, etc.) where Oozie runs a single Mapper task in YARN, it does not consider that it’s a real MapReduce job. Hence it uses a different set of undocumented properties always prefixed with … Read more

Create HIVE Table with multi character delimiter

FILELDS TERMINATED BY does not support multi-character delimiters. The easiest way to do this is to use RegexSerDe: CREATE EXTERNAL TABlE tableex(id INT, name STRING) ROW FORMAT ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’ WITH SERDEPROPERTIES ( “input.regex” = “^(\\d+)~\\*(.*)$” ) STORED AS TEXTFILE LOCATION ‘/user/myusername’;

hadoop java.net.URISyntaxException: Relative path in absolute URI: rsrc:hbase-common-0.98.1-hadoop2.jar

The exception is a bit misleading; there’s no real relative path being parsed, the issue here is that Hadoop “Path” doesn’t support ‘:’ in filenames. In your case, “rsrc:hbase-common-0.98.1-hadoop2.jar” is being interpreted as “rsrc” being the “scheme”, whereas I suspect you really intended to add the resource file:///path/to/your/jarfile/rsrc:hbase-common-0.98.1-hadoop2.jar”. Here’s an old JIRA discussing the illegal … Read more