hive - w3toppers.com

Hive Data Retrieval Queries: Difference between CLUSTER BY, ORDER BY, and SORT BY

In short, for your questions: Does CLUSTER BY guarantee a global order? No. DISTRIBUTE BY puts the same keys into same reducers but what about the adjacent keys? Depends on the hash function, which depends on your query. related question: How does the built-in Apache Hive hash function work and where can I find that … Read more

What is the equivalent of Presto UNNEST function in Hive

Use lateral view [outer] explode. A lateral view first applies the UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias. This is example from Presto migration from Hive docs: SELECT student, score FROM tests LATERAL VIEW explode(scores) … Read more

How do I Combine or Merge Small ORC files into Larger ORC file?

You do not need to re-invent the wheel. ALTER TABLE table_name [PARTITION partition_spec] CONCATENATE can be used to merge small ORC files into a larger file since Hive 0.14.0. The merge happens at the stripe level, which avoids decompressing and decoding the data. It works fast. I’d suggest to create an external table partitioned by … Read more

How to calculate Median in spark sqlContext for column of data type double

For non integral values you should use percentile_approx UDF: import org.apache.spark.mllib.random.RandomRDDs val df = RandomRDDs.normalRDD(sc, 1000, 10, 1).map(Tuple1(_)).toDF(“x”) df.registerTempTable(“df”) sqlContext.sql(“SELECT percentile_approx(x, 0.5) FROM df”).show // +——————–+ // | _c0| // +——————–+ // |0.035379710486199915| // +——————–+ On a side not you should use GROUP BY not PARTITION BY. Latter one is used for window functions and … Read more

Can Hive recursively descend into subdirectories without partitions or editing hive-site.xml?

Using Hive in HDInsight, I set the following properties before I create my external table in the Hive query and it works for me. SET hive.mapred.supports.subdirectories=TRUE; SET mapred.input.dir.recursive=TRUE;

How do I limit the number of rows per field value in SQL?

Unluckily mysql doesn’t have Analytical Functions. So you have to play with variables. Supposing you have an autoincrement field: mysql> create table mytab ( -> id int not null auto_increment primary key, -> first_column int, -> second_column int -> ) engine = myisam; Query OK, 0 rows affected (0.05 sec) mysql> insert into mytab (first_column,second_column) … Read more

Hive loading in partitioned table

Ning Zhang has a great response on the topic at http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables. The quick context is that: Load data simply copies data, it doesn’t read it so it cannot figure out what to partition Would suggest that you load data into an intermediate table first (or using an external table pointing to all the files) and … Read more

Update , SET option in Hive

INSERT OVERWRITE TABLE _tableName_ PARTITION (_partitionColumn_= _partitionValue_) SELECT [other Things], CASE WHEN id=206 THEN ‘florida’ ELSE location END AS location, [other Other Things] FROM _tableName_ WHERE [_whereClause_]; You can have multiple partitions listed by separating them by commas. … PARTITION (_partitionColumn_= _partitionValue1_, _partitionColumn_= _partitionValue2_, …). I haven’t done this with multiple partitions, just one at … Read more

How can I convert array to string in hive sql?

Use concat_ws(string delimiter, array<string>) function to concatenate array: select actor, concat_ws(‘,’,collect_set(date)) as grpdate from actor_table group by actor; If the date field is not string, then convert it to string: concat_ws(‘,’,collect_set(cast(date as string))) Read also this answer about alternative ways if you already have an array (of int) and do not want to explode it … Read more

How to update partition metadata in Hive , when partition data is manualy deleted from HDFS

EDIT : Starting with Hive 3.0.0 MSCK can now discover new partitions or remove missing partitions (or both) using the following syntax : MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS] This was implemented in HIVE-17824 As correctly stated by HakkiBuyukcengiz, MSCK REPAIR doesn’t remove partitions if the corresponding folder on HDFS was manually deleted, it only … Read more