HiveQL: Using query results as variables

Hive substitutes variables as is and does not execute them. Use shell wrapper script to get result into variable and pass it to your Hive script. maximo=$(hive -e “set hive.cli.print.header=false; select max(var) from table;”) hive -hiveconf “maximo”=”$maximo” -f your_hive_script.hql And after this inside your script you can use select ‘${hiveconf:maximo}’

How to control partition size in Spark SQL

Spark < 2.0: You can use Hadoop configuration options: mapred.min.split.size. mapred.max.split.size as well as HDFS block size to control partition size for filesystem based formats*. val minSplit: Int = ??? val maxSplit: Int = ??? sc.hadoopConfiguration.setInt(“mapred.min.split.size”, minSplit) sc.hadoopConfiguration.setInt(“mapred.max.split.size”, maxSplit) Spark 2.0+: You can use spark.sql.files.maxPartitionBytes configuration: spark.conf.set(“spark.sql.files.maxPartitionBytes”, maxSplit) In both cases these values may not … Read more

Execute Hive Query with IN clause parameters in parallel

There is no need to read the same data many times in separate queries to achieve better parallelism. Tune proper mapper and reducer parallelism for the same. First of all, enable PPD with vectorizing, use CBO and Tez: SET hive.optimize.ppd=true; SET hive.optimize.ppd.storage=true; SET hive.vectorized.execution.enabled=true; SET hive.vectorized.execution.reduce.enabled = true; SET hive.cbo.enable=true; set hive.stats.autogather=true; set hive.compute.query.using.stats=true; set … Read more

HIVE select count(*) non null returns higher value than select count(*)

Most probably your query without where is using statistics because of this parameter is set: set hive.compute.query.using.stats=true; Try to set it false and execute again. Alternatively you can compute statistics on the table. See ANALYZE TABLE SYNTAX Also it’s possible to gather statistics during INSERT OVERWRITE automatically: set hive.stats.autogather=true;

How to set variables in HIVE scripts

You need to use the special hiveconf for variable substitution. e.g. hive> set CURRENT_DATE=’2012-09-16′; hive> select * from foo where day >= ${hiveconf:CURRENT_DATE} similarly, you could pass on command line: % hive -hiveconf CURRENT_DATE=’2012-09-16′ -f test.hql Note that there are env and system variables as well, so you can reference ${env:USER} for example. To see … Read more

How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?

For Spark 1.x, you can set with : System.setProperty(“hive.metastore.uris”, “thrift://METASTORE:9083”); final SparkConf conf = new SparkConf(); SparkContext sc = new SparkContext(conf); HiveContext hiveContext = new HiveContext(sc); Or final SparkConf conf = new SparkConf(); SparkContext sc = new SparkContext(conf); HiveContext hiveContext = new HiveContext(sc); hiveContext.setConf(“hive.metastore.uris”, “thrift://METASTORE:9083”); Update If your Hive is Kerberized : Try setting these … Read more