第一步: 需求分析

需要哪些字段(时间:每一天,各个时段,id,url,guid,tracTime)
需要分区为天/时
PV(统计记录数)
UV(guid去重)

第二步: 实施步骤

建Hive表,表列分隔符和文件保持一至
Load数据到Hive表中
写HiveSql进行统计,将结果放入Hive另一张表中(数据清洗)
从Hive的另一张表中的数据导出到Mysql,使用sqoop
网站项目从Mysql读取这张表的信息

预期结果

日期		小时		PV		UV

第三步: 实施

# 建源表(注意进入beeline用户名密码是linux的)
  create database if not exists track_log;
  use track_log;
  create table if not exists yhd_source(
  id string,
  url string,
  referer string,
  keyword string,
  type string,
  guid string,
  pageId string,
  moduleId string,
  linkId string,
  attachedInfo string,
  sessionId string,
  trackerU string,
  trackerType string,
  ip string,
  trackerSrc string,
  cookie string,
  orderCode string,
  trackTime string,
  endUserId string,
  firstLink string,
  sessionViewNo string,
  productId string,
  curMerchantId string,
  provinceId string,
  cityId string,
  fee string,
  edmActivity string,
  edmEmail string,
  edmJobId string,
  ieVersion string,
  platform string,
  internalKeyword string,
  resultSum string,
  currentPage string,
  linkPosition string,
  buttonPosition string
  )row format delimited fields terminated by '\t'
  stored as textfile
  load data local inpath '/home/liuwl/opt/datas/2015082818' into table yhd_source;
  load data local inpath '/home/liuwl/opt/datas/2015082819' into table yhd_source;
# 创建清洗表
  create table if not exists yhd_clean(
  id string,
  url string,
  guid string,
  date string,
  hour string)
  row format delimited fields terminated by '\t'
  insert into table yhd_clean select id,url,guid,substring(trackTime,9,2) date,substring(trackTime,12,2) hour from yhd_source;
  select id,date,hour from yhd_clean limit 5;
# 改建分区表(静态分区)
  create table if not exists yhd_part1(
  id string,
  url string,
  guid string
  ) partitioned by (date string,hour string)
  row format delimited fields terminated by '\t'
  insert into table yhd_part1 partition (date='28',hour='18') select id,url,guid from yhd_clean where date='28' and hour='18';
  insert into table yhd_part1 partition (date='28',hour='19') select id,url,guid from yhd_clean where date='28' and hour='19';
  select id,date ,hour from yhd_part1 where date ='28' and hour='18' limit 10;
# 使用动态分区需要修改部分参数
 hive.exec.dynamic.partition--true
 hive.exec.dynamic.partition.mode--nonstrict
  create table if not exists yhd_part2(
  id string,
  url string,
  guid string
  ) partitioned by (date string,hour string)
  row format delimited fields terminated by '\t'
# 动态分区根据partition字段进行匹配
  insert into table yhd_part2 partition (date,hour) select * from yhd_clean;
  select id,date ,hour from yhd_part2 where date ='28' and hour='18' limit 10;
# 实现需求
  PV: select date,hour,count(url) PV from yhd_part1 group by date,hour;
  0: jdbc:hive2://hadoop09-linux-01.ibeifeng.co> select date,hour,count(url) PV from yhd_part1 group by date,hour;
  +-------+-------+--------+--+
  | date | hour | pv |
  +-------+-------+--------+--+
  | 28 | 18 | 64972 |
  | 28 | 19 | 61162 |
  +-------+-------+--------+--+
  UV: select date,hour,count(distinct(guid)) UV from yhd_part1 group by date,hour;
  0: jdbc:hive2://hadoop09-linux-01.ibeifeng.co> select date,hour,count(distinct(guid)) UV from yhd_part1 group by date,hour;
  +-------+-------+--------+--+
  | date | hour | uv |
  +-------+-------+--------+--+
  | 28 | 18 | 23938 |
  | 28 | 19 | 22330 |
  +-------+-------+--------+--+
# 结合放入log_result表
  create table if not exists log_result as select date,hour,count(url) PV,count(distinct(guid)) UV from yhd_part1 group by date,hour;
  select date,hour,pv,uv from log_result;
  0: jdbc:hive2://hadoop09-linux-01.ibeifeng.co> select date,hour,pv,uv from log_result;
  +-------+-------+--------+--------+--+
  | date | hour | pv | uv |
  +-------+-------+--------+--------+--+
  | 28 | 18 | 64972 | 23938 |
  | 28 | 19 | 61162 | 22330 |
  +-------+-------+--------+--------+--+
# 将结果表导出到Mysql,使用Sqoop
# 在Mysql中创建数据库和表
  create database if not exists track_result;
  use track_result;
  create table if not exists log_track_result(
  date varchar(10) not null,
  hour varchar(10) not null,
  pv varchar(10) not null,
  uv varchar(10) not null,
  primary key(date,hour)
  );
# 使用sqoop export 导出到log_track_result表
  bin/sqoop export \
  --connect jdbc:mysql://hadoop09-linux-01.ibeifeng.com:3306/track_result \
  --username root \
  --password root \
  --table log_track_result \
  --export-dir /user/hive/warehouse/track_log.db/log_result \
  --num-mappers 1 \
  --input-fields-terminated-by '\001'
# 在Mysql中查询测试
  select * from log_track_result;
  mysql> select * from log_track_result;
  +------+------+-------+-------+
  | date | hour | pv | uv |
  +------+------+-------+-------+
  | 28 | 18 | 64972 | 23938 |
  | 28 | 19 | 61162 | 22330 |
  +------+------+-------+-------+
  2 rows in set (0.00 sec)  

最新文章

  1. 502 Bad Gateway深究
  2. Sharepoint学习笔记—其它—如何知道某个Sharepoint环境的安装类型
  3. 使用Axis2实现WebService的发布和调用
  4. RHEL5.8的NFS配置文件
  5. 如何把.cs文件编译成DLL文件
  6. 第一次装TFS的曲折经历
  7. Android中实现消息推送(JPush)
  8. Objective之ARC
  9. Aptana 插件 for Eclipse 4.4
  10. C# 延迟初始化
  11. cocos2d_x 问题汇总
  12. ruby使用IO类读写文件
  13. 【Android】Android程序保护与破解浅析
  14. Tomcat工作原理详解
  15. 洛谷 [P4011] 孤岛营救问题
  16. Cocos2d中update与fixedUpdate的区别(三)
  17. 在HTML页面中有jQuery实现实现拼图小游戏
  18. Wcf传递的参数实际不为空,但是接收时显示为空。
  19. 金融量化分析【day113】:多因子选股
  20. 第26月第7天 mac如何matplotlib中文乱码问题

热门文章

  1. 无法打开包括文件:“windows.h”: No such file or directory
  2. linux 操作mysql
  3. Android 第3方控件一览表
  4. c文件操作 (转)
  5. 锤子banner
  6. Understanding Kafka Consumer Groups and Consumer Lag
  7. mysql注入小测试
  8. Angular JS 学习之简介
  9. win dos命令行设置ip和dns
  10. POJ3368 Frequent values(RMQ线段树)