Hive窗口函数

mac2024-01-31 63

窗口函数：

窗口函数指的就是每一条数据的窗口 OVER（），如果over（）里面没有约束，则表示整张表的窗口（全表）。

over():指定分析函数工作的数据窗口大小，这个数据窗口大小可能回随着行的变化而变化

current row:当前行，UNBOUNDED PRECEDING 表示从前面的起点，UNBOUNDED

FOLLOWING 表示到后面的终点

n preceding: 往前 n 行数据

n following: 往后 n 行数据

unbounded : 起点，

lag（col，n）：往前第 n 行数据

lead（col，n）：往后第 n 行数据

ntile（n）：把有序分区中的行分发到指定数据的组中，各个组有编号，编号从1考试，对每一行，ntile返回此行所属的组。

//创建表 hive> create table business( > name string, > orderdate string, > cost int > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; //导入函数 load data local inpath "/opt/db/business.txt" into table business; //要求（1）查询在2017年4月份购买过的顾客及总人数（2）查询顾客的购买明细及月购买总额（3）上述的场景,要将 cost 按照日期进行累加（4）查询每个顾客上次的购买时间（5）查询前 20%时间的订单信息 //解答：（1）查询在2017年4月份购买过的顾客及总人数 select name,count(*) from business where sunstring(orderdate,1,7)="2017-04" group bu name; //结果 mart 2 jack 2 //解答：（2）查询顾客的购买明细及月购买总额 //按月份month(orderdate)分区partition by select *,sum(cost) over(partition by month(orderdate)) from business; 或者 select *,sum(cost) over(distribute by month(orderdate)) from business;\ //结果： jack 2017-01-01 10 205 jack 2017-01-08 55 205 tony 2017-01-07 50 205 jack 2017-01-05 46 205 tony 2017-01-04 29 205 tony 2017-01-02 15 205 jack 2017-02-03 23 23 mart 2017-04-13 94 341 jack 2017-04-06 42 341 mart 2017-04-11 75 341 mart 2017-04-09 68 341 mart 2017-04-08 62 341 neil 2017-05-10 12 12 neil 2017-06-12 80 80 //解答（3）上述的场景,要将 cost 按照日期进行累加 hive> select *,sum(cost) over(sort by orderdate rows between unbounded preceding and current row) from business; //窗口函数中sort by orderdate rows between unbounded preceding and current row表示按照月份进行分组，从当前行到最后（rows表示很多行） //结果 jack 2017-01-01 10 10 tony 2017-01-02 15 25 tony 2017-01-04 29 54 jack 2017-01-05 46 100 tony 2017-01-07 50 150 jack 2017-01-08 55 205 jack 2017-02-03 23 228 jack 2017-04-06 42 270 mart 2017-04-08 62 332 mart 2017-04-09 68 400 mart 2017-04-11 75 475 mart 2017-04-13 94 569 neil 2017-05-10 12 581 neil 2017-06-12 80 661 //解析（4）查询每个顾客上次的购买时间 select *, lag(orderdate,1) over(distribute by name sort by orderdate), lead(orderdate,1) over(distribute by name sort by orderdate) from business; //结果 jack 2017-01-01 10 NULL 2017-01-05 jack 2017-01-05 46 2017-01-01 2017-01-08 jack 2017-01-08 55 2017-01-05 2017-02-03 jack 2017-02-03 23 2017-01-08 2017-04-06 jack 2017-04-06 42 2017-02-03 NULL mart 2017-04-08 62 NULL 2017-04-09 mart 2017-04-09 68 2017-04-08 2017-04-11 mart 2017-04-11 75 2017-04-09 2017-04-13 mart 2017-04-13 94 2017-04-11 NULL neil 2017-05-10 12 NULL 2017-06-12 neil 2017-06-12 80 2017-05-10 NULL tony 2017-01-02 15 NULL 2017-01-04 tony 2017-01-04 29 2017-01-02 2017-01-07 tony 2017-01-07 50 2017-01-04 NULL //解析（5）查询前20%时间的订单信息 hive> select * from( > select name,orderdate,cost,ntile(5) over(order by orderdate) sorted > from business > ) t > where sorted = 1; //结果 jack 2017-01-01 10 1 tony 2017-01-02 15 1 tony 2017-01-04 29 1 count()与sum()的区别： 1 apple 1.00 2 pear 2.00 select count(price) from fruit; ----执行之后结果为：2 (表示有2条记录) select sum(price) from fruit;---执行之后结果为：3:00（表示各记录price字段之和为3.00） count 是数个数， sum 是求和 String方法下面的subString()的作用，截取字符串【提取字符串中两个指定的索引号之间的字符】

排序：4 种

//全局排序（Order By）全程只有一个Reduce，默认升序（ASC），降序（DESC） //每个MapReduce内部排序（Sort By），每个Reduce内部进行排序，对全部结果集来说不是排序。（需要设置Reduce个数，尽量和分区的个数一致） //分区排序（Distribute By）：类MR种partition，进行处理，否则无法看到分区排序的效果（需要设置Reduce个数，尽量和分区的个数一致） //Cluster By当distribute by和sorts by字段相同的时候，可以使用cluster方式。（只能是升序）

最新回复(0)