hive开窗开窗函数进阶
概述
开窗函数的理解参见: 理解hive中的开窗函数
over()
中除了可以使用partition by
选择分组字段外, 还有以下函数
order by
排序- 指定聚合行的范围, 配合
order by
使用current row
: 当前行n PRECEDING
: 往前 n 行数据n FOLLOWING
: 往后 n 行数据UNBOUNDED PRECEDING
表示从前面的起点UNBOUNDED FOLLOWING
表示到后面的终点
- 生成新的列
LAG(col,n)
: 往后第 n 行数据LEAD(col,n)
: 往前第 n 行数据NTILE(n)
: 把有序分区中的行分发到指定数据的组中, 各个组有编号, 编号从 1 开始,对于每一行, NTILE 返回此行所属的组的编号。 注意: n 必须为 int 类型。
- 生成新的列:
Rank
RANK()
排序相同时会重复, 总数不会变DENSE_RANK()
排序相同时会重复, 总数会减少ROW_NUMBER()
会根据顺序计算
在over()中使用排序
order by
排序
- 未排序
hive (default)> select name,orderdate,cost, sum(cost) over(partition by name) as sample from business; name orderdate cost sample jack 2017-01-01 10 176 jack 2017-02-03 23 176 jack 2017-01-05 46 176 jack 2017-04-06 42 176 jack 2017-01-08 55 176 mart 2017-04-13 94 299 mart 2017-04-08 62 299 mart 2017-04-09 68 299 mart 2017-04-11 75 299
- 排序
hive (default)> select name,orderdate,cost, sum(cost) over(partition by name order by orderdate) as sample3 from business; name orderdate cost sample3 jack 2017-01-01 10 10 jack 2017-01-05 46 56 jack 2017-01-08 55 111 jack 2017-02-03 23 134 jack 2017-04-06 42 176 mart 2017-04-08 62 62 mart 2017-04-09 68 130 mart 2017-04-11 75 205 mart 2017-04-13 94 299
指定开窗聚合的行
rows between UNBOUNDED PRECEDING and CURRENT ROW
从起行到当前行的聚合, 而不是对行组内的所有行进行聚合, 配合order by
使用
hive (default)> select name,orderdate,cost, sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample from business; name orderdate cost sample jack 2017-01-01 10 10 jack 2017-01-05 46 56 jack 2017-01-08 55 111 jack 2017-02-03 23 134 jack 2017-04-06 42 176 mart 2017-04-08 62 62 mart 2017-04-09 68 130 mart 2017-04-11 75 205 mart 2017-04-13 94 299 neil 2017-05-10 12 12 neil 2017-06-12 80 92 tony 2017-01-02 15 15 tony 2017-01-04 29 44 tony 2017-01-07 50 94
rows between 1 PRECEDING and current row
将上一行与本行聚合, 而不是将行组内所有行聚合
hive (default)> select name,orderdate,cost, sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample from business; jack 2017-01-01 10 10 jack 2017-01-05 46 56 jack 2017-01-08 55 101 jack 2017-02-03 23 78 jack 2017-04-06 42 65 mart 2017-04-08 62 62 mart 2017-04-09 68 130 mart 2017-04-11 75 143 mart 2017-04-13 94 169 neil 2017-05-10 12 12 neil 2017-06-12 80 92 tony 2017-01-02 15 15 tony 2017-01-04 29 44 tony 2017-01-07 50 79
rows between 1 PRECEDING AND 1 FOLLOWING
当前行和前一行及后面一行聚合
rows between current row and UNBOUNDED FOLLOWING
当前行及后面所有行
rows between current row and UNBOUNDED FOLLOWING
当前行及后面所有行
生成新的列: LAG, LEAD, NTILE
LAG(col, n)
第三个参数用于代替"NLL"
select name,orderdate,cost, lag(orderdate,1,‘1900-01-01‘) over(partition by name order by orderdate ) as time1, lag(orderdate,2) over (partition by name order by orderdate) as time2 from business;
结果: time1将orderdate列往后移一行, time2往后移两行
name orderdate cost time1 time2 jack 2017-01-01 10 1900-01-01 NULL jack 2017-01-05 46 2017-01-01 NULL jack 2017-01-08 55 2017-01-05 2017-01-01 jack 2017-02-03 23 2017-01-08 2017-01-05 jack 2017-04-06 42 2017-02-03 2017-01-08 mart 2017-04-08 62 1900-01-01 NULL mart 2017-04-09 68 2017-04-08 NULL mart 2017-04-11 75 2017-04-09 2017-04-08 mart 2017-04-13 94 2017-04-11 2017-04-09 neil 2017-05-10 12 1900-01-01 NULL neil 2017-06-12 80 2017-05-10 NULL tony 2017-01-02 15 1900-01-01 NULL tony 2017-01-04 29 2017-01-02 NULL tony 2017-01-07 50 2017-01-04 2017-01-02
LEAD(col, n)
同时, 只是往前移指定的行数
NTILE(n)
需要配合order by
使用
select * from ( select name,orderdate,cost, ntile(5) over() sorted from business ) t;
结果: ntitle(5)
, 增加一列(组标签), 用于分组, 这里为5组
t.name t.orderdate t.cost t.sorted jack 2017-01-01 10 1 tony 2017-01-02 15 1 jack 2017-02-03 23 1 tony 2017-01-04 29 2 jack 2017-01-05 46 2 jack 2017-04-06 42 2 tony 2017-01-07 50 3 jack 2017-01-08 55 3 mart 2017-04-08 62 3 mart 2017-04-09 68 4 neil 2017-05-10 12 4 mart 2017-04-11 75 4 neil 2017-06-12 80 5 mart 2017-04-13 94 5
如果分组是行数数以n有余数, 则从上到下每组增加一行, 直到所以行都有组标签.
生成新的列: Rank
RANK() 排序相同时会重复, 总数不会变
DENSE_RANK() 排序相同时会重复, 总数会减少
ROW_NUMBER() 会根据顺序计算
select name, subject, score, rank() over(partition by subject order by score desc) rp, dense_rank() over(partition by subject order by score desc) drp, row_number() over(partition by subject order by score desc) rmp from score; name subject score rp drp rmp sk 数学 95 1 1 1 zs 数学 86 2 2 2 ls 数学 85 3 3 3 ww 数学 56 4 4 4 zs 英语 84 1 1 1 ww 英语 84 1 1 2 ls 英语 78 3 2 3 sk 英语 68 4 3 4 ww 语文 94 1 1 1 sk 语文 87 2 2 2 ls 语文 65 3 3 3 zs 语文 64 4 4 4
相关推荐
archive 2020-07-30
成长之路 2020-07-28
eternityzzy 2020-07-19
taisenki 2020-07-05
tugangkai 2020-07-05
SignalDu 2020-07-05
zlsdmx 2020-07-05
tomson 2020-07-05
tugangkai 2020-07-04
tomson 2020-07-05
Zhangdragonfly 2020-06-28
genshengxiao 2020-06-26
成长之路 2020-06-26
tomson 2020-06-26
蜗牛之窝 2020-06-26
cyydjt 2020-06-25