Migrating 50TB data from local Hadoop cluster to Google Cloud Storage

Question 1: The recommended way of moving data from a local Hadoop cluster to GCS is to use the Google Cloud Storage connector for Hadoop. The instructions on that site are mostly for running Hadoop on Google Compute Engine VMs, but you can also download the GCS connector directly, either gcs-connector-1.2.8-hadoop1.jar if you’re using Hadoop 1.x or Hadoop 0.20.x, or gcs-connector-1.2.8-hadoop2.jar for Hadoop 2.x or Hadoop 0.23.x.

Simply copy the jarfile into your hadoop/lib dir or $HADOOP_COMMON_LIB_JARS_DIR in the case of Hadoop 2:

cp ~/Downloads/gcs-connector-1.2.8-hadoop1.jar /your/hadoop/dir/lib/

You may need to also add the following to your hadoop/conf/hadoop-env.sh file if youre running 0.20.x:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/your/hadoop/dir/lib/gcs-connector-1.2.8-hadoop1.jar

Then, you’ll likely want to use service-account “keyfile” authentication since you’re on an on-premise Hadoop cluster. Visit your cloud.google.com/console, find APIs & auth on the left-hand-side, click Credentials, if you don’t already have one click Create new Client ID, select Service account before clicking Create client id, and then for now, the connector requires a “.p12” type of keypair, so click Generate new P12 key and keep track of the .p12 file that gets downloaded. It may be convenient to rename it before placing it in a directory more easily accessible from Hadoop, e.g:

cp ~/Downloads/*.p12 /path/to/hadoop/conf/gcskey.p12

Add the following entries to your core-site.xml file in your Hadoop conf dir:

<property>
  <name>fs.gs.impl</name>
  <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
</property>
<property>
  <name>fs.gs.project.id</name>
  <value>your-ascii-google-project-id</value>
</property>
<property>
  <name>fs.gs.system.bucket</name>
  <value>some-bucket-your-project-owns</value>
</property>
<property>
  <name>fs.gs.working.dir</name>
  <value>/</value>
</property>
<property>
  <name>fs.gs.auth.service.account.enable</name>
  <value>true</value>
</property>
<property>
  <name>fs.gs.auth.service.account.email</name>
  <value>[email protected]</value>
</property>
<property>
  <name>fs.gs.auth.service.account.keyfile</name>
  <value>/path/to/hadoop/conf/gcskey.p12</value>
</property>

The fs.gs.system.bucket generally won’t be used except in some cases for mapred temp files, you may want to just create a new one-off bucket for that purpose. With those settings on your master node, you should already be able to test hadoop fs -ls gs://the-bucket-you-want to-list. At this point, you can already try to funnel all the data out of the master node with a simple hadoop fs -cp hdfs://yourhost:yourport/allyourdata gs://your-bucket.

If you want to speed it up using Hadoop’s distcp, sync the lib/gcs-connector-1.2.8-hadoop1.jar and conf/core-site.xml to all your Hadoop nodes, and it should all work as expected. Note that there’s no need to restart datanodes or namenodes.

Question 2: While the GCS connector for Hadoop is able to copy direct from HDFS without ever needing an extra disk buffer, GSUtil cannot since it has no way of interpreting the HDFS protocol; it only knows how to deal with actual local filesystem files or as you said, GCS/S3 files.

Question 3: The benefit of using the Java API is flexibility; you can choose how to handle errors, retries, buffer sizes, etc, but it takes more work and planning. Using gsutil is good for quick use cases, and you inherit a lot of error-handling and testing from the Google teams. The GCS connector for Hadoop is actually built directly on top of the Java API, and since it’s all open-source, you can see what kinds of things it takes to make it work smoothly here in its source code on GitHub : https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageImpl.java

Leave a Comment