writing query in hive like this:
SELECT COUNT(DISTINCT id) ....
will always result in using only one reducer.
You should:
-
use this command to set desired number of reducers:
set mapred.reduce.tasks=50
-
rewrite query as following:
SELECT COUNT(*) FROM ( SELECT DISTINCT id FROM … ) t;
This will result in 2 map+reduce jobs instead of one, but performance gain will be substantial.