Hive 处理count distinct 产生数据倾斜处理
问题描述
问题数据倾斜范畴,但是又不可以在Map端join,剔除特殊Key等方法进行处理。
set hive.groupby.skewindata=true; insert overwrite table ad_overall_day partition(part_time='99', part_date='2015-11-99') select account_id, nvl(client_id,-1), nvl(track_id,'total'), sum(if(type=3,1,0)) as imp_cnt, sum(if(type=4,1,0)) as click_cnt, count(distinct if(type=3,zid,NULL)) as imp_uv, count(distinct if(type=4,zid,NULL)) as click_uv from derived_di_v3 where year='2015' and month='11' group by account_id, client_id, track_id grouping sets((account_id, client_id, track_id),(account_id, client_id),(account_id))
But error find. FAILED: SemanticException [Error 10022]: DISTINCT on different columns not supported with skew in data
separate group-by and join the results.
set hive.groupby.skewindata=true;
set hive.exec.parallel=true;
insert overwrite table ad_overall_day partition(part_time='99', part_date='2015-11-99') 
SELECT COALESCE(t1.account_id,t2.account_id),COALESCE(t1.client_id,t2.client_id),
    COALESCE(t1.track_id,t2.track_id),t1.imp_cnt,t1.imp_uv,t2.click_cnt,t2.click_uv
FROM
(select account_id, nvl(client_id,-1) as client_id, nvl(track_id,'total') as track_id, 
sum(if(type=3,1,0)) as imp_cnt, count(distinct if(type=3,zid,NULL)) as imp_uv
FROM derived_di_v3 where year='2015' and month='11' 
group by account_id, client_id, track_id 
grouping sets((account_id, client_id, track_id),(account_id, client_id),(account_id)) ) t1
FULL OUTER JOIN
(select account_id, nvl(client_id,-1) as client_id, nvl(track_id,'total') as track_id, 
sum(if(type=4,1,0)) as click_cnt, count(distinct if(type=4,zid,NULL)) as click_uv
FROM derived_di_v3 where year='2015' and month='11' 
group by account_id, client_id, track_id 
grouping sets((account_id, client_id, track_id),(account_id, client_id),(account_id)) ) t2  
ON t1.account_id = t2.account_id and t1.client_id = t2.client_id and t1.track_id = t2.track_id;Cann’t run two MapReduce Job
Unfortunately, hive doesn’t explain the hql to two mapreduce job. 
The parameter hive.groupby.skewindata seems has no affect.
Change the hql :
insert overwrite table ad_overall_day partition(part_time='99', part_date='2015-11-99')
select account_id, nvl(client_id,-1) as client_id, nvl(track_id,'total') as track_id, 
    sum(imp1) as imp_cnt,count(imp2) as imp_uv,sum(click1) as click_cnt,count(click2) as click_uv
FROM (select account_id, client_id, track_id,
     if(type=3,1,0) as imp1,if(type=3,zid,NULL) as imp2,
     if(type=4,1,0) as click1,if(type=4,zid,NULL) as click2
    FROM dmp.derived_di_v3 where year='2015' and month='11' 
    group by account_id, client_id, track_id,type,zid) t
group by account_id, client_id, track_id 
grouping sets((account_id, client_id, track_id),(account_id, client_id),(account_id))Help Links
Hive 的详细介绍:请点这里
Hive 的下载地址:请点这里
相关推荐
  archive    2020-07-30  
   成长之路    2020-07-28  
   eternityzzy    2020-07-19  
   taisenki    2020-07-05  
   tugangkai    2020-07-05  
   SignalDu    2020-07-05  
   zlsdmx    2020-07-05  
   tomson    2020-07-05  
   tugangkai    2020-07-04  
   tomson    2020-07-05  
   Zhangdragonfly    2020-06-28  
   genshengxiao    2020-06-26  
   成长之路    2020-06-26  
   tomson    2020-06-26  
   蜗牛之窝    2020-06-26  
   成长之路    2020-06-25