Cannot Read a file from HDFS using Spark
Here is the solution sc.textFile(“hdfs://nn1home:8020/input/war-and-peace.txt”) How did I find out nn1home:8020? Just search for the file core-site.xml and look for xml element fs.defaultFS
Here is the solution sc.textFile(“hdfs://nn1home:8020/input/war-and-peace.txt”) How did I find out nn1home:8020? Just search for the file core-site.xml and look for xml element fs.defaultFS
You need to do something like this: bin/stop-all.sh (or stop-dfs.sh and stop-yarn.sh in the 2.x serie) rm -Rf /app/tmp/hadoop-your-username/* bin/hadoop namenode -format (or hdfs in the 2.x serie) the solution was taken from: http://pages.cs.brandeis.edu/~cs147a/lab/hadoop-troubleshooting/. Basically it consists in restarting from scratch, so make sure you won’t loose data by formating the hdfs.
-copyFromLocal is similar to -put command, except that the source is restricted to a local file reference. So basically, you can do with put, all that you do with -copyFromLocal, but not vice-versa. Similarly, -copyToLocal is similar to get command, except that the destination is restricted to a local file reference. Hence, you can use … Read more
That’s not the real error, here’s how to find it: Go to the hadoop jobtracker web-dashboard, find the hive mapreduce jobs that failed and look at the logs of the failed tasks. That will show you the real error. The console output errors are useless, largely beause it doesn’t have a view of the individual … Read more
Hadoop is basically 3 things, a FS (Hadoop Distributed File System), a computation framework (MapReduce) and a management bridge (Yet Another Resource Negotiator). HDFS allows you store huge amounts of data in a distributed (provides faster read/write access) and redundant (provides better availability) manner. And MapReduce allows you to process this huge data in a … Read more
I think the main difference I can describe relates to record oriented vs. column oriented formats. Record oriented formats are what we’re all used to — text files, delimited formats like CSV, TSV. AVRO is slightly cooler than those because it can change schema over time, e.g. adding or removing columns from a record. Other … Read more
Check out this post from Alan Gates, Pig architect at Yahoo!, that compares when would use a SQL like Hive rather than Pig. He makes a very convincing case as to the usefulness of a procedural language like Pig (vs. declarative SQL) and its utility to dataflow designers.
A. Oozie specifics Oozie propagates the “regular” Hadoop properties to a “regular” MapReduce Action. But for other types of Action (Shell, Hive, Java, etc.) where Oozie runs a single Mapper task in YARN, it does not consider that it’s a real MapReduce job. Hence it uses a different set of undocumented properties always prefixed with … Read more
FILELDS TERMINATED BY does not support multi-character delimiters. The easiest way to do this is to use RegexSerDe: CREATE EXTERNAL TABlE tableex(id INT, name STRING) ROW FORMAT ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’ WITH SERDEPROPERTIES ( “input.regex” = “^(\\d+)~\\*(.*)$” ) STORED AS TEXTFILE LOCATION ‘/user/myusername’;
The exception is a bit misleading; there’s no real relative path being parsed, the issue here is that Hadoop “Path” doesn’t support ‘:’ in filenames. In your case, “rsrc:hbase-common-0.98.1-hadoop2.jar” is being interpreted as “rsrc” being the “scheme”, whereas I suspect you really intended to add the resource file:///path/to/your/jarfile/rsrc:hbase-common-0.98.1-hadoop2.jar”. Here’s an old JIRA discussing the illegal … Read more