(大数据静态处理demo)flume+hadoop+hive+sqoop+mysql

mac2026-05-12  9

(大数据静态处理demo)flume+hadoop+hive+sqoop+mysql


文章目录

(大数据静态处理demo)flume+hadoop+hive+sqoop+mysql一、日志文件source.log二、配置flume(三台)三、启动zookeeper、hadoop、flume四、flume写入文件,自动上传到hdfs五、hive处理数据六、使用sqoop导出数据到mysql

思路:分析日志文件,flume将日志文件上传到hadoop。hive获取日志信息到表一。表一清洗一遍数据到表二,表二处理结果到表三。计算出当日访问数量和当日用户访问数。表三通过sqoop导出表到mysql。完成数据处理。

安装教程: 大数据集群搭建—《flume篇》 大数据集群搭建—《高可用hadoop篇》 大数据集群搭建—《hive篇》 大数据集群搭建—《sqoop篇》


一、日志文件source.log

192.168.123.12 mid_1 tel 100 20181111 192.168.123.123 mid_2 tel 100 20181111 192.168.123.12 mid_1 tel 100 20181111 192.168.123.12 mid_1 tel 100 20181111


二、配置flume(三台)

转移配置文件 //1.将hadoop的hdfs-site.xml和core-site.xml 放到flume/conf下 cp /opt/hadoop/hadoop-2.7.3/etc/hadoop/core-site.xml /opt/flume/apache-flume-1.8.0-bin/conf/ cp /opt/hadoop/hadoop-2.7.3/etc/hadoop/hdfs-site.xml /opt/flume/apache-flume-1.8.0-bin/conf/ //2.将hadoop的jar包拷贝到flume的lib目录下 cp /opt/hadoop/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar /opt/flume/apache-flume-1.8.0-bin/lib/ cp /opt/hadoop/hadoop-2.7.3/share/hadoop/common/lib/hadoop-auth-2.7.3.jar /opt/flume/apache-flume-1.8.0-bin/lib/ cp /opt/hadoop/hadoop-2.7.3/share/hadoop/common/lib/commons-configuration-1.6.jar /opt/flume/apache-flume-1.8.0-bin/lib/ 增加配置flume-hdfs.conf(主机) cd /opt/flume/apache-flume-1.8.0-bin/conf/ vi flume_hdfs.conf //添加如下 #定义agent名, source、channel、sink的名称 a4.sources = r1 a4.channels = c1 a4.sinks = k1 #具体定义source a4.sources.r1.type = spooldir #先创建此目录,保证里面空的 a4.sources.r1.spoolDir = /logs #具体定义channel a4.channels.c1.type = memory a4.channels.c1.capacity = 10000 a4.channels.c1.transactionCapacity = 100 #定义拦截器,为消息添加时间戳 a4.sources.r1.interceptors = i1 a4.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder #具体定义sink a4.sinks.k1.type = hdfs #集群的nameservers名字 #单节点的直接写:hdfs://zhiyou01:9000/xxx a4.sinks.k1.hdfs.path = hdfs://ns/flume/%Y%m%d a4.sinks.k1.hdfs.filePrefix = events- a4.sinks.k1.hdfs.fileType = DataStream #防止超时 a1.sinks.k1.hdfs.callTimeout=3600000 #不按照条数生成文件 a4.sinks.k1.hdfs.rollCount = 0 #HDFS上的文件达到128M时生成一个文件 a4.sinks.k1.hdfs.rollSize = 134217728 #HDFS上的文件达到60秒生成一个文件 a4.sinks.k1.hdfs.rollInterval = 60 #组装source、channel、sink a4.sources.r1.channels = c1 a4.sinks.k1.channel = c1

三、启动zookeeper、hadoop、flume

//1.zookeeper(三台) cd /opt/zookeeper/zookeeper-3.4.12/bin/ ./zkServer.sh start //2.hadoop cd /opt/hadoop/hadoop-2.7.3/sbin/ ./start-all.sh //3.flume cd /opt/flume/apache-flume-1.8.0-bin bin/flume-ng agent -n a4 -c conf -f conf/flume_hdfs.conf -Dflume.root.logger=INFO,console

四、flume写入文件,自动上传到hdfs

//在flume指定文件夹添加日志文件 cd /logs/ vi source.log 192.168.123.12 mid_1 tel 100 20181111 192.168.123.123 mid_2 tel 100 20181111 192.168.123.12 mid_1 tel 100 20181111 192.168.123.12 mid_1 tel 100 20181111

五、hive处理数据

导入数据到表一 cd /opt/hive/apache-hive-2.3.3-bin/bin ./hive //创建数据库 create database zhiyou_log; use zhiyou_log; //创建表一 create table source_log(ip string,mid string,type string,price double,time string) row format delimited fields terminated by " "; //导入数据 load data inpath "/flume/20191102/events-.1572661918682" into table source_log;

清洗数据到表二(拿出ip,mid) create table clear_log(ips String,mids String) row format delimited fields terminated by " "; insert into clear_log select ip,mid from source_log;

计算结果到表三 create table result_log(pv bigint,nv bigint,ips String) row format delimited fields terminated by " "; //计算数据并插入 insert into result_log select count(mids) pv ,count(distinct(mids)) nv ,count(distinct(ips)) ips from clear_log;

六、使用sqoop导出数据到mysql

建立mysql数据库 sqoop导出数据 cd /opt/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0/bin/ ./sqoop export --connect jdbc:mysql://192.168.80.128:3306/sqloop --username root --password HBQ521521cf* --export-dir '/user/hive/warehouse/zhiyou_log.db/result_log' --table tongji --fields-terminated-by ' ';

最新回复(0)