Hive – Unpivot functionality in hive

Whenever I want to pivot a table in Hive, I collect key:value pairs to a map and then reference each key in the next level, creating new columns. This is the opposite of that. Query: select a.userid, y.new_id from ( select new_id, fruit_name, fruit_code from ( select new_id, map(“apple_id”, apple_id , “mango_id”, mango_id , “grape_id”, … Read more

How to Access Hive via Python?

I believe the easiest way is to use PyHive. To install you’ll need these libraries: pip install sasl pip install thrift pip install thrift-sasl pip install PyHive Please note that although you install the library as PyHive, you import the module as pyhive, all lower-case. If you’re on Linux, you may need to install SASL … Read more

What is the difference between Apache Spark SQLContext vs HiveContext?

Spark 2.0+ Spark 2.0 provides native window functions (SPARK-8641) and features some additional improvements in parsing and much better SQL 2003 compliance so it is significantly less dependent on Hive to achieve core funcionality and because of that HiveContext (SparkSession with Hive support) seems to be slightly less important. Spark < 2.0 Obviously if you … Read more

how many mappers and reduces will get created for a partitoned table in hive

Mappers: Number of mappers depends on various factors such as how the data is distributed among nodes, input format, execution engine and configuration params. See also here: https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works MR uses CombineInputFormat, while Tez uses grouped splits. Tez: set tez.grouping.min-size=16777216; — 16 MB min split set tez.grouping.max-size=1073741824; — 1 GB max split MapReduce: set mapreduce.input.fileinputformat.split.minsize=16777216; — … Read more

How to update table in Hive 0.13?

You can use row_number or full join. This is example using row_number: insert overwrite table_1 select customer_id, items, price, updated_date from ( select customer_id, items, price, updated_date, row_number() over(partition by customer_id order by new_flag desc) rn from ( select customer_id, items, price, updated_date, 0 as new_flag from table_1 union all select customer_id, items, price, updated_date, … Read more

Find last day of a month in Hive

As of Hive 1.1.0, last_day(string date) function is available. last_day(string date) Returns the last day of the month which the date belongs to. date is a string in the format ‘yyyy-MM-dd HH:mm:ss’ or ‘yyyy-MM-dd’. The time part of date is ignored.

hive regexp_extract weirdness

From the docs https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF it appears that regexp_extract() is a record/line extraction of the data you wish to extract. It seems to work on a first found (then quit) as opposed to global. Therefore the index references the capture group. 0 = the entire match 1 = capture group 1 2 = capture group 2, … Read more