hive 基础

mac2022-06-30 63

Apache的顶级项目，（java）

2008年Facebook公司开源给Apache基金会

官网：http://hive.apache.org/

hive 将SQL转换成MapReduce程序，并将程序提交到YARN 集群运行，不会自动生成结果文件

　　直接读取HDFS上的数据然后处理

　　sql query（查询）

概念：未定

　　　hive 是基于Hadoop的一个开源数据仓库工具

　　　能够将结构化数据映射成为一张数据库表（二维表），

　　　底层依赖HDFS存储数据，Hive的本质是HQL语句转化成MR程序，提交给Hadoop运行　　

和传统数据库的区别：

　　具有sql数据库的外表，但应用场景完全不同hive只试用于做批量数据分析【海量离线数据的统计分析】

，Hive核心组件

　　解释器：把HQL语句转换成一颗抽象语法树

　　编译器：把抽象语法树转化成一系列MR程序

　　Hive的底层有一系列的MR模板（Operation：GroupByOperation，JoinOperation）

　　优化器：执行这一系列MR程序的优化

　　执行器：组织相关的资源提交给Hadoop集群

------------------------------------------------------------------------------------------------------------------------------hive安装上传、解压、重命名修改配置 hive-env.sh (Hadoop_home) 创建数据仓库： hive数据需要保持到HDFS上,(hdfs上需要有hive所运行的路径和权限!) 创建对应的目录和赋予权限 bin/hdfs dfs -mkdir -p /tmp bin/hdfs dfs -mkdir -p /user/hive/warehouse bin/hdfs dfs -chmod g+w /tmp bin/hdfs dfs -chmod g+w /user/hive/warehouse hive配置官方网址: https://cwiki.apache.org/confluence/display/Hive/GettingStarted

------------------------------------------------------------------------------------------------------------------------------hive和mysql的差异: 共性: database：数据库，命名空间将同一类的表放到此空间下，方便进行管理操作 table：表，字段：数据类型和字段名称

区别：

　　hive没有真正意义...

在Hive之外执行Hive命令

创建分区表：

create table test01 (id int ,name string )partitioned by (dt int) row format delimited fields terminated by ',' ;

a.hql----------------------脚本内容

　　　　　　　　　　insert into table test02 partition (dt=${hiveconf:dt}) select id,name from test01 where dt=${hiveconf:dt})

hive -f a.hql -hiveconf dt=20190101

库就是hdfs的一个文件夹(库名+.db),默认位置在/user/hive/warehouse表就是库下面的一个文件夹(表名)分区就是表下的一个文件夹(分区字段名+=+分区的值，例如：dt=20180101)启动hive的命令:hive后台启动hive trift服务的命令：nohup hiveserver2 > /dev/null 2>&1 &启动beeline服务：beeline使用beeline连接hive：!connect jdbc:hive2://test-hadoop-2-21:10000显示所有的库：show databases;显示所有的表：show tables；显示表：desc student;显示所有的分区：show partitions student;hive不区分表名和字段名及关键字的大小写如果加载hdfs上的数据，那么数据会被移动到表目录下面

1.创建普通表create table student01 (id int,name string,age int) row formatdelimited fields terminated by ',';

2.加载本地(linux系统)数据追加:load data local inpath '/home/hadoop/tmp/a.txt' into table student01;覆盖(使用关键字overwrite):load data local inpath '/home/hadoop/tmp/a.txt' overwrite into table student01;

3.创建分区表create table Student_partition (id int,name string) partitioned by (dt int) row format delimited fields terminated by ',';

4.给分区表加载数据load data inpath '/a.txt' into table student_partition partition (dt=20180101);

5.添加分区alter table student_partition add partition (dt = 20180102);

6.删除分区alter table student_partition drop partition (dt = 20180102);

7.分区重命名alter table student_partition partition(dt = 20180101) rename to partition (dt = 20180102);

8.多级分区create table student (id int ,name string) partitioned by (country string, city string) row format delimited fields terminated by ',';以上例子为二级分区,city目录在country目录下面重命名时需要指定两个分区或者直接修改hdfs目录，然后修复分区msck repair table student;

9.动态分区设置动态分区模式开启set hive.exec.dynamic.partition.mode=nonstrict;insert into table student partition (dt) select id,name,dt from student01;静态+动态insert into table test04 partition (dt = '20180102',sid) select id,name,id from test01;静态分区字段在前面如果报内存溢出：set hive.exec.dynamic.partition.mode=nonstrict;//每个节点生成动态分区的最大个数，默认是100set hive.exec.max.dynamic.partitions.pernode=10000;//一个DML操作可以创建的最大动态分区数，默认是1000set hive.exec.max.dynamic.partitions=100000;有时候内存溢出会报错,减小set mapred.max.split.size的值(增加map的数量)，增加map内存set mapred.max.split.size=25600000;set mapreduce.map.memory.mb=8192;set mapreduce.reduce.memory.mb=8192;set mapred.child.java.opts=-Xmx6144m;

10.修改分区字段数据类型alter table test05 partition column (dt int);

11.外部表create external table test06 (id int,name string) row format delimited fields terminated by ',' location '/aaa';外部表和内部表的区别删除内部表时数据会被删除，外部表不会创建内部表会在hdfs上生成目录，外部表不会外部表的数据一般都是在创建表时指定一个路径，内部表一般需要自己加载，而且数据会被移动到表目录下面

12.修改表名alter table test01 rename to test02;

13.添加字段alter table test01 add columns (age int);

14.修改字段名和字段类型alter table test01 change age new_age string;

15.orccreate table test07(id int,name string) stored as orc;默认采用zlib压缩//采用snappy压缩create table test07(id int,name string) stored as orc tblproperties("orc.compress"="SNAPPY");

zlib压缩：压缩率高，压缩和解压慢snappy压缩：压缩率还可以，压缩和解压快

内存溢出Current usage: 166.4 MB of 1 GB physical memory used; 2.5 GB of 2.1 GB virtual memory used. Killing container设置参数

16.多插入insert into table test12 select max(age) from test10;insert into table test13 select min(age) from test10;

from test10 insert into table test12 select max(age) insert into table test13 select min(age);

17.overwrite

load语句是在into前面加overwriteinsert语句是把into替换成overwrite

18.在命令行外执行hivehive -e "select * from test.test10" > res.txthive -f test.hql -hiveconf age=9test.hql select * from test.test10 where age = ${hiveconf:age};19.distinct去重select distinct id from test01;//把id去重select distinct id,name from test01;//如果id和name都一样才去重

20.case whenselect case when id = 1001 then 10000 else 0 end from test10;

21.ifselect if(id=1001,10000,0) from test10;

22.group by 分组select id,count(id) from test10 group by id;select后面只能跟分组字段或者聚合函数group by 后面不能跟where，应该使用having

23.order by全局排序只能产生一个reduce task

24.sort by局部排序只能保证每个reduce task有序set mapreduce.job.reduces=n

25.distribute by把数据分发到不同的reduce task

26.连接内连接:inner join 符合连接的左右表信息显示左外连接:left join 左边数据都显示，右表不符合连接条件的数据补null右外连接:right join 与左连接相反左半连接:left semi join 可以使用in代替

参数:并行执行set hive.exec.parallel=true;设置中间结果压缩set mapred.map.output.compression.codec=org.apache.Hadoop.io.compress.SnappyCodec;如果发生数据倾斜，可以尝试设置此参数hive.groupby.skewindata=true

27.in不能嵌套使用不能多个in一起使用不能和join的on条件一起使用

28.udf(1)继承UDF类(2)实现evluate方法(3)打成jar包并上传(4)进入hive添加到环境 add jar path;(5)创建临时函数 create temporary function function_name as 'package+Class';(6)直接调用,只在本次session有效

29.udtf

30.强转cast (id as string)

31.展示自带函数的用法desc function extended function_name;

转载于:https://www.cnblogs.com/Vowzhou/p/10514779.html