hive - w3toppers.com

HiveQL: Using query results as variables

Hive substitutes variables as is and does not execute them. Use shell wrapper script to get result into variable and pass it to your Hive script. maximo=$(hive -e “set hive.cli.print.header=false; select max(var) from table;”) hive -hiveconf “maximo”=”$maximo” -f your_hive_script.hql And after this inside your script you can use select ‘${hiveconf:maximo}’

What is the difference between partitioning and bucketing a table in Hive ?

Partitioning data is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion. Example: if we are dealing with a large employee table and often run queries with WHERE clauses that restrict the results to a particular country or department . For a faster query response … Read more

How to control partition size in Spark SQL

Spark < 2.0: You can use Hadoop configuration options: mapred.min.split.size. mapred.max.split.size as well as HDFS block size to control partition size for filesystem based formats*. val minSplit: Int = ??? val maxSplit: Int = ??? sc.hadoopConfiguration.setInt(“mapred.min.split.size”, minSplit) sc.hadoopConfiguration.setInt(“mapred.max.split.size”, maxSplit) Spark 2.0+: You can use spark.sql.files.maxPartitionBytes configuration: spark.conf.set(“spark.sql.files.maxPartitionBytes”, maxSplit) In both cases these values may not … Read more

Execute Hive Query with IN clause parameters in parallel

There is no need to read the same data many times in separate queries to achieve better parallelism. Tune proper mapper and reducer parallelism for the same. First of all, enable PPD with vectorizing, use CBO and Tez: SET hive.optimize.ppd=true; SET hive.optimize.ppd.storage=true; SET hive.vectorized.execution.enabled=true; SET hive.vectorized.execution.reduce.enabled = true; SET hive.cbo.enable=true; set hive.stats.autogather=true; set hive.compute.query.using.stats=true; set … Read more

How to create SparkSession with Hive support (fails with “Hive classes are not found”)?

Add following dependency to your maven project. <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.11</artifactId> <version>2.0.0</version> </dependency>

HIVE select count() non null returns higher value than select count()

Most probably your query without where is using statistics because of this parameter is set: set hive.compute.query.using.stats=true; Try to set it false and execute again. Alternatively you can compute statistics on the table. See ANALYZE TABLE SYNTAX Also it’s possible to gather statistics during INSERT OVERWRITE automatically: set hive.stats.autogather=true;

How to set variables in HIVE scripts

You need to use the special hiveconf for variable substitution. e.g. hive> set CURRENT_DATE=’2012-09-16′; hive> select * from foo where day >= ${hiveconf:CURRENT_DATE} similarly, you could pass on command line: % hive -hiveconf CURRENT_DATE=’2012-09-16′ -f test.hql Note that there are env and system variables as well, so you can reference ${env:USER} for example. To see … Read more

SQL split comma separated row [duplicate]

You can do it with pure SQL like this SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(t.values, ‘,’, n.n), ‘,’, -1) value FROM table1 t CROSS JOIN ( SELECT a.N + b.N * 10 + 1 n FROM (SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT … Read more

Hive: Best way to do incremetal updates on a main table

If merge in ACID mode is not applicable, then it’s possible to update using FULL OUTER JOIN or using UNION ALL + row_number. To find all entries that will be updated you can join increment data with old data: insert overwrite target_data [partition() if applicable] SELECT –select new if exists, old if not exists case … Read more

How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?

For Spark 1.x, you can set with : System.setProperty(“hive.metastore.uris”, “thrift://METASTORE:9083”); final SparkConf conf = new SparkConf(); SparkContext sc = new SparkContext(conf); HiveContext hiveContext = new HiveContext(sc); Or final SparkConf conf = new SparkConf(); SparkContext sc = new SparkContext(conf); HiveContext hiveContext = new HiveContext(sc); hiveContext.setConf(“hive.metastore.uris”, “thrift://METASTORE:9083”); Update If your Hive is Kerberized : Try setting these … Read more