hive - w3toppers.com

Hive Explode / Lateral View multiple arrays

I found a very good solution to this problem without using any UDF, posexplode is a very good solution : SELECT COOKIE , ePRODUCT_ID, eCAT_ID, eQTY FROM TABLE LATERAL VIEW posexplode(PRODUCT_ID) ePRODUCT_IDAS seqp, ePRODUCT_ID LATERAL VIEW posexplode(CAT_ID) eCAT_ID AS seqc, eCAT_ID LATERAL VIEW posexplode(QTY) eQTY AS seqq, eDateReported WHERE seqp = seqc AND seqc = … Read more

Multiple Spark applications with HiveContext

By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user. If you want to support multiple running applications you should configure a standalone metastore. At this moment Hive supports PostgreSQL, MySQL, Oracle and MySQL. Details of configuration depend on a backend and option … Read more

How to convert ISO Date to UTC date in Hive

Hive understands this format: ‘yyyy-MM-dd HH:mm:ss.SSS’. Use unix_timestamp() to convert to seconds passed from 1970-01-01, then use from_unixtime() to convert to proper format: select from_unixtime(UNIX_TIMESTAMP(“2017-01-01T05:01:10Z”, “yyyy-MM-dd’T’HH:mm:ss’Z'”),”yyyy-MM-dd HH:mm:ss”); Result: 2017-01-01 05:01:10 Update. This method is to remove Z and replace T with space using regexp_replace and convert to timestamp if necessary, without using unix_timestamp(), this will … Read more

Specify minimum number of generated files from Hive insert

The number of files generated during INSERT … SELECT depends on the number of processes running on final reducer (final reducer vertex if you are running on Tez) plus bytes per reducer configured. If the table is partitioned and there is no DISTRIBUTE BY specified, then in the worst case each reducer creates files in … Read more

How to convert .txt file to Hadoop’s sequence file format

So the way more simplest answer is just an “identity” job that has a SequenceFile output. Looks like this in java: public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = new Configuration(); Job job = new Job(conf); job.setJobName(“Convert Text”); job.setJarByClass(Mapper.class); job.setMapperClass(Mapper.class); job.setReducerClass(Reducer.class); // increase if you need sorting or a special … Read more

java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

starting the hive metastore service worked for me. First, set up the database for hive metastore: $ hive –service metastore ` https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_installing_manually_book/content/validate_installation.html Second, run the following commands: $ schematool -dbType mysql -initSchema $ schematool -dbType mysql -info https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool

Hive query performance for high cardinality field

Use ORC with bloom filters: CREATE TABLE employee ( employee_id bigint, name STRING ) STORED AS ORC TBLPROPERTIES (“orc.bloom.filter.columns”=”employee_id”) ; Enable PPD with vectorizing, use CBO and Tez: SET hive.optimize.ppd=true; SET hive.optimize.ppd.storage=true; SET hive.vectorized.execution.enabled=true; SET hive.vectorized.execution.reduce.enabled = true; SET hive.cbo.enable=true; set hive.stats.autogather=true; set hive.compute.query.using.stats=true; set hive.stats.fetch.partition.stats=true; set hive.execution.engine=tez; set hive.stats.fetch.column.stats=true; set hive.map.aggr=true; SET hive.tez.auto.reducer.parallelism=true; Ref: … Read more

Why is the final reduce step extremely slow in this MapReduce? (HiveQL, HDFS MapReduce)

If final reducer is a join then it looks like skew in join key. First of all check two things: check that b.f1 join key has no duplicates: select b.f1, count(*) cnt from B b group by b.f1 having count(*)>1 order by cnt desc; check the distribution of a.f1: select a.f1, count(*) cnt from A … Read more

Use collect_list and collect_set in Spark SQL

Spark 2.0+: SPARK-10605 introduced native collect_list and collect_set implementation. SparkSession with Hive support or HiveContext are no longer required. Spark 2.0-SNAPSHOT (before 2016-05-03): You have to enable Hive support for a given SparkSession: In Scala: val spark = SparkSession.builder .master(“local”) .appName(“testing”) .enableHiveSupport() // <- enable Hive support. .getOrCreate() In Python: spark = (SparkSession.builder .enableHiveSupport() .getOrCreate()) … Read more

Create Table in Hive with one file

There are many possible solutions: 1) Add distribute by partition key at the end of your query. Maybe there are many partitions per reducer and each reducer creates files for each partition. This may reduce the number of files and memory consumption as well. hive.exec.reducers.bytes.per.reducer setting will define how much data each reducer will process. … Read more