Hive Explode / Lateral View multiple arrays

I found a very good solution to this problem without using any UDF, posexplode is a very good solution : SELECT COOKIE , ePRODUCT_ID, eCAT_ID, eQTY FROM TABLE LATERAL VIEW posexplode(PRODUCT_ID) ePRODUCT_IDAS seqp, ePRODUCT_ID LATERAL VIEW posexplode(CAT_ID) eCAT_ID AS seqc, eCAT_ID LATERAL VIEW posexplode(QTY) eQTY AS seqq, eDateReported WHERE seqp = seqc AND seqc = … Read more

How to convert ISO Date to UTC date in Hive

Hive understands this format: ‘yyyy-MM-dd HH:mm:ss.SSS’. Use unix_timestamp() to convert to seconds passed from 1970-01-01, then use from_unixtime() to convert to proper format: select from_unixtime(UNIX_TIMESTAMP(“2017-01-01T05:01:10Z”, “yyyy-MM-dd’T’HH:mm:ss’Z'”),”yyyy-MM-dd HH:mm:ss”); Result: 2017-01-01 05:01:10 Update. This method is to remove Z and replace T with space using regexp_replace and convert to timestamp if necessary, without using unix_timestamp(), this will … Read more

How to convert .txt file to Hadoop’s sequence file format

So the way more simplest answer is just an “identity” job that has a SequenceFile output. Looks like this in java: public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = new Configuration(); Job job = new Job(conf); job.setJobName(“Convert Text”); job.setJarByClass(Mapper.class); job.setMapperClass(Mapper.class); job.setReducerClass(Reducer.class); // increase if you need sorting or a special … Read more

java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

starting the hive metastore service worked for me. First, set up the database for hive metastore: $ hive –service metastore ` https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_installing_manually_book/content/validate_installation.html Second, run the following commands: $ schematool -dbType mysql -initSchema $ schematool -dbType mysql -info https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool

Hive query performance for high cardinality field

Use ORC with bloom filters: CREATE TABLE employee ( employee_id bigint, name STRING ) STORED AS ORC TBLPROPERTIES (“orc.bloom.filter.columns”=”employee_id”) ; Enable PPD with vectorizing, use CBO and Tez: SET hive.optimize.ppd=true; SET hive.optimize.ppd.storage=true; SET hive.vectorized.execution.enabled=true; SET hive.vectorized.execution.reduce.enabled = true; SET hive.cbo.enable=true; set hive.stats.autogather=true; set hive.compute.query.using.stats=true; set hive.stats.fetch.partition.stats=true; set hive.execution.engine=tez; set hive.stats.fetch.column.stats=true; set hive.map.aggr=true; SET hive.tez.auto.reducer.parallelism=true; Ref: … Read more

Use collect_list and collect_set in Spark SQL

Spark 2.0+: SPARK-10605 introduced native collect_list and collect_set implementation. SparkSession with Hive support or HiveContext are no longer required. Spark 2.0-SNAPSHOT (before 2016-05-03): You have to enable Hive support for a given SparkSession: In Scala: val spark = SparkSession.builder .master(“local”) .appName(“testing”) .enableHiveSupport() // <- enable Hive support. .getOrCreate() In Python: spark = (SparkSession.builder .enableHiveSupport() .getOrCreate()) … Read more

Create Table in Hive with one file

There are many possible solutions: 1) Add distribute by partition key at the end of your query. Maybe there are many partitions per reducer and each reducer creates files for each partition. This may reduce the number of files and memory consumption as well. hive.exec.reducers.bytes.per.reducer setting will define how much data each reducer will process. … Read more