Hive优化(SQL)

mac2024-10-16 90

Hive优化(SQL) 1、where语句优化 select m.cid,u.id from order m join customer u on( m.cid =u.id )where m.dt=‘20180808’; 可优化为 select m.cid,u.id from （select * from order where dt=‘20180818’） m join customer u on( m.cid =u.id); 2、union优化尽量不要使用union （union 去掉重复的记录）而是使用 union all 然后在用group by 去重 3、count distinct优化不要使用count (distinct cloumn) ,而要使用子查询实现count(distinct) select count(1) from (select id from tablename group by id) tmp; 4、如果需要根据一张表的字段约束另一个张表，用in代替join select id,name from tb1 a join tb2 b on(a.id = b.id); 可优化为 select id,name from tb1 where id in(select id from tb2); in 要比join 快 5、消灭子查询内的 group by 、 COUNT(DISTINCT)，MAX，MIN。可以减少job的数量。 6、join优化 map端join set hive.auto.convert.join = true; 默认为true set hive.mapjoin.smalltable.filesize=25000000; 设置小表的阈值 7、本地模式当 Hive 查询处理的数据量比较小时，其实没有必要启动分布式模式去执行，因为以分布式方式执行就涉及到跨网络传输、多节点协调等，并且消耗资源。这个时间可以只使用本地模式来执行 mapreduce job，只在一台机器上执行，速度会很快 set hive.exec.mode.local.auto=true 是打开 hive 自动判断是否启动本地模式的开关，但是只是打开这个参数并不能保证启动本地模式，要当 map 任务数不超过 hive.exec.mode.local.auto.input.files.max 的个数并且 map 输入文件大小不超过 hive.exec.mode.local.auto.inputbytes.max 所指定的大小时，才能启动本地模式。

最新回复(0)