How to calculate Median in spark sqlContext for column of data type double

For non integral values you should use percentile_approx UDF: import org.apache.spark.mllib.random.RandomRDDs val df = RandomRDDs.normalRDD(sc, 1000, 10, 1).map(Tuple1(_)).toDF(“x”) df.registerTempTable(“df”) sqlContext.sql(“SELECT percentile_approx(x, 0.5) FROM df”).show // +——————–+ // | _c0| // +——————–+ // |0.035379710486199915| // +——————–+ On a side not you should use GROUP BY not PARTITION BY. Latter one is used for window functions and … Read more

How do I limit the number of rows per field value in SQL?

Unluckily mysql doesn’t have Analytical Functions. So you have to play with variables. Supposing you have an autoincrement field: mysql> create table mytab ( -> id int not null auto_increment primary key, -> first_column int, -> second_column int -> ) engine = myisam; Query OK, 0 rows affected (0.05 sec) mysql> insert into mytab (first_column,second_column) … Read more

Hive loading in partitioned table

Ning Zhang has a great response on the topic at http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables. The quick context is that: Load data simply copies data, it doesn’t read it so it cannot figure out what to partition Would suggest that you load data into an intermediate table first (or using an external table pointing to all the files) and … Read more

Update , SET option in Hive

INSERT OVERWRITE TABLE _tableName_ PARTITION (_partitionColumn_= _partitionValue_) SELECT [other Things], CASE WHEN id=206 THEN ‘florida’ ELSE location END AS location, [other Other Things] FROM _tableName_ WHERE [_whereClause_]; You can have multiple partitions listed by separating them by commas. … PARTITION (_partitionColumn_= _partitionValue1_, _partitionColumn_= _partitionValue2_, …). I haven’t done this with multiple partitions, just one at … Read more

How can I convert array to string in hive sql?

Use concat_ws(string delimiter, array<string>) function to concatenate array: select actor, concat_ws(‘,’,collect_set(date)) as grpdate from actor_table group by actor; If the date field is not string, then convert it to string: concat_ws(‘,’,collect_set(cast(date as string))) Read also this answer about alternative ways if you already have an array (of int) and do not want to explode it … Read more

How to update partition metadata in Hive , when partition data is manualy deleted from HDFS

EDIT : Starting with Hive 3.0.0 MSCK can now discover new partitions or remove missing partitions (or both) using the following syntax : MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS] This was implemented in HIVE-17824 As correctly stated by HakkiBuyukcengiz, MSCK REPAIR doesn’t remove partitions if the corresponding folder on HDFS was manually deleted, it only … Read more