OOZIE: properties defined in file referenced in global job-xml not visible in workflow.xml

OK, you are making two big mistakes. 1. Let’s start with a quick exegesis of some parts of the Oozie documentation (V4.2) Workflow Functional Specification has a section 19 about Global Configuration has sections 3.2.x about core Action types i.e. MapReduce, Pig, Java, etc. the XML schema specification clearly shows the <global> element Sqoop action … Read more

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics

You can’t mix bits of Hadoop and expect things to work. It’s not just the close coupling between internal classes in hadoop-common and hadoop-aws, its things like the specific version of the amazon-aws SDK the hadoop-aws module was built it. If you get ClassNotFoundException or MethodNotFoundException stack traces when trying to work with s3a:// URLs, … Read more

Pyspark: get list of files/directories on HDFS path

Using JVM gateway maybe is not so elegant, but in some cases the code below could be helpful: URI = sc._gateway.jvm.java.net.URI Path = sc._gateway.jvm.org.apache.hadoop.fs.Path FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration fs = FileSystem.get(URI(“hdfs://somehost:8020”), Configuration()) status = fs.listStatus(Path(‘/some_dir/yet_another_one_dir/’)) for fileStatus in status: print(fileStatus.getPath())

How can I access S3/S3n from a local Hadoop 2.6 installation?

For some reason, the jar hadoop-aws-[version].jar which contains the implementation to NativeS3FileSystem is not present in the classpath of hadoop by default in the version 2.6 & 2.7. So, try and add it to the classpath by adding the following line in hadoop-env.sh which is located in $HADOOP_HOME/etc/hadoop/hadoop-env.sh: export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/* Assuming you are using Apache … Read more

Set hadoop system user for client embedded in Java webapp

Finally I stumbled on the constant static final String HADOOP_USER_NAME = “HADOOP_USER_NAME”;` in the UserGroupInformation class. Setting this either as an environment variable, as a Java system property on startup (using -D) or programmatically with System.setProperty(“HADOOP_USER_NAME”, “hduser”); makes Hadoop use whatever username you want for connecting to the remote Hadoop cluster.

Hadoop java.io.IOException: Mkdirs failed to create /some/path

Just ran into this problem running mahout from CDH4 in standalone mode in my MacBook Air. The issue is that a /tmp/hadoop-xxx/xxx/LICENSE file and a /tmp/hadoop-xxx/xxx/license directory are being created on a case-insensitive file system when unjarring the mahout jobs. I was able to workaround this by deleting META-INF/LICENSE from the jar file like this: … Read more

How to delete and update a record in Hive

As of Hive version 0.14.0: INSERT…VALUES, UPDATE, and DELETE are now available with full ACID support. INSERT … VALUES Syntax: INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] …)] VALUES values_row [, values_row …] Where values_row is: ( value [, value …] ) where a value is either null or any valid SQL literal UPDATE Syntax: … Read more