Sqoop import : composite primary key and textual primary key

Specify split column manually. Split column is not necessarily equal to PK. You can have complex PK and some int Split column. You can specify any integer column or even simple function (some simple function like substring or cast, not aggregation or analytic). Split column preferably should be evenly distributed integer. For example if your … Read more

How to fix corrupt HDFS FIles

You can use hdfs fsck / to determine which files are having problems. Look through the output for missing or corrupt blocks (ignore under-replicated blocks for now). This command is really verbose especially on a large HDFS filesystem so I normally get down to the meaningful output with hdfs fsck / | egrep -v ‘^\.+$’ … Read more

Name node is in safe mode. Not able to leave

In order to forcefully let the namenode leave safemode, following command should be executed: bin/hadoop dfsadmin -safemode leave You are getting Unknown command error for your command as -safemode isn’t a sub-command for hadoop fs, but it is of hadoop dfsadmin. Also after the above command, I would suggest you to once run hadoop fsck … Read more

How to open/stream .zip files through Spark?

There was no solution with python code and I recently had to read zips in pyspark. And, while searching how to do that I came across this question. So, hopefully this’ll help others. import zipfile import io def zip_extract(x): in_memory_data = io.BytesIO(x[1]) file_obj = zipfile.ZipFile(in_memory_data, “r”) files = [i for i in file_obj.namelist()] return dict(zip(files, … Read more

How to transpose/pivot data in hive?

Here is the approach i used to solved this problem using hive’s internal UDF function, “map”: select b.id, b.code, concat_ws(”,b.p) as p, concat_ws(”,b.q) as q, concat_ws(”,b.r) as r, concat_ws(”,b.t) as t from ( select id, code, collect_list(a.group_map[‘p’]) as p, collect_list(a.group_map[‘q’]) as q, collect_list(a.group_map[‘r’]) as r, collect_list(a.group_map[‘t’]) as t from ( select id, code, map(proc1,proc2) as … Read more