hive - w3toppers.com

How to load data to hive from HDFS without removing the source file?

from your question I assume that you already have your data in hdfs. So you don’t need to LOAD DATA, which moves the files to the default hive location /user/hive/warehouse. You can simply define the table using the externalkeyword, which leaves the files in place, but creates the table definition in the hive metastore. See … Read more

Hive: Add partitions for existing folder structure

Use msck repair table command: MSCK [REPAIR] TABLE tablename; or ALTER TABLE tablename RECOVER PARTITIONS; if you are running Hive on EMR. Read more details about both commands here: RECOVER PARTITIONS

Hive – Unpivot functionality in hive

Whenever I want to pivot a table in Hive, I collect key:value pairs to a map and then reference each key in the next level, creating new columns. This is the opposite of that. Query: select a.userid, y.new_id from ( select new_id, fruit_name, fruit_code from ( select new_id, map(“apple_id”, apple_id , “mango_id”, mango_id , “grape_id”, … Read more

How to Access Hive via Python?

I believe the easiest way is to use PyHive. To install you’ll need these libraries: pip install sasl pip install thrift pip install thrift-sasl pip install PyHive Please note that although you install the library as PyHive, you import the module as pyhive, all lower-case. If you’re on Linux, you may need to install SASL … Read more

What is the difference between Apache Spark SQLContext vs HiveContext?

Spark 2.0+ Spark 2.0 provides native window functions (SPARK-8641) and features some additional improvements in parsing and much better SQL 2003 compliance so it is significantly less dependent on Hive to achieve core funcionality and because of that HiveContext (SparkSession with Hive support) seems to be slightly less important. Spark < 2.0 Obviously if you … Read more

how many mappers and reduces will get created for a partitoned table in hive

Mappers: Number of mappers depends on various factors such as how the data is distributed among nodes, input format, execution engine and configuration params. See also here: https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works MR uses CombineInputFormat, while Tez uses grouped splits. Tez: set tez.grouping.min-size=16777216; — 16 MB min split set tez.grouping.max-size=1073741824; — 1 GB max split MapReduce: set mapreduce.input.fileinputformat.split.minsize=16777216; — … Read more

How to update table in Hive 0.13?

You can use row_number or full join. This is example using row_number: insert overwrite table_1 select customer_id, items, price, updated_date from ( select customer_id, items, price, updated_date, row_number() over(partition by customer_id order by new_flag desc) rn from ( select customer_id, items, price, updated_date, 0 as new_flag from table_1 union all select customer_id, items, price, updated_date, … Read more

Find last day of a month in Hive

As of Hive 1.1.0, last_day(string date) function is available. last_day(string date) Returns the last day of the month which the date belongs to. date is a string in the format ‘yyyy-MM-dd HH:mm:ss’ or ‘yyyy-MM-dd’. The time part of date is ignored.

Hive unable to manually set number of reducers

writing query in hive like this: SELECT COUNT(DISTINCT id) …. will always result in using only one reducer. You should: use this command to set desired number of reducers: set mapred.reduce.tasks=50 rewrite query as following: SELECT COUNT(*) FROM ( SELECT DISTINCT id FROM … ) t; This will result in 2 map+reduce jobs instead of … Read more

hive regexp_extract weirdness

From the docs https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF it appears that regexp_extract() is a record/line extraction of the data you wish to extract. It seems to work on a first found (then quit) as opposed to global. Therefore the index references the capture group. 0 = the entire match 1 = capture group 1 2 = capture group 2, … Read more