hive SQL优化之distribute by和sort by
2024-08-26 02:50:00
近期在优化hiveSQL。
以下是一段排序,分组后取每组第一行记录的SQL
- INSERT OVERWRITE TABLE t_wa_funnel_distinct_temp PARTITION (pt='${SRCTIME}')
- SELECT
- bussiness_id,
- cookie_id,
- session_id,
- funnel_id,
- group_first(funnel_name) funnel_name,
- step_id,
- group_first(step_name) step_name,
- group_first(log_type) log_type,
- group_first(url_pattern) url_pattern,
- group_first(url) url,
- group_first(refer) refer,
- group_first(log_time) log_time,
- group_first(is_new_visitor) is_new_visitor,
- group_first(is_mobile_traffic) is_mobile_traffic,
- group_first(is_bounce) is_bounce,
- group_first(campaign_name) campaign_name,
- group_first(group_name) group_name,
- group_first(slot_name) slot_name,
- group_first(source_type) source_type,
- group_first(next_page) next_page,
- group_first(continent) continent,
- group_first(sub_continent_region) sub_continent_region,
- group_first(country) country,
- group_first(region) region,
- group_first(city) city,
- group_first(language) language,
- group_first(browser) browser,
- group_first(os) os,
- group_first(screen_color) screen_color,
- group_first(screen_resolution) screen_resolution,
- group_first(flash_version) flash_version,
- group_first(java) java,
- group_first(host) host
- FROM
- ( SELECT *
- FROM r_wa_funnel
- WHERE pt='${SRCTIME}'
- ORDER BY bussiness_id, cookie_id, session_id, funnel_id, step_id, log_time ASC
- ) t1
- GROUP BY pt, bussiness_id, cookie_id, session_id, funnel_id, step_id;
group_first: 自己定义函数。用户取每组第一个字段
${SRCTIME}:
由外部oozie调度传入, 作为时间分区,精确到小时.eg: 2011.11.01.21
以下在hive上以SRCTIME = 2011.11.01.21
运行以上SQL. 2011.11.01.21小时分区记录数有10435486
运行时间:
从上面能够看出,reduce阶段仅仅有一个reduce, 这是由于ORDER BY是全局排序,hive仅仅能通过一个reduce进行排序
从业务需求来看, 仅仅要按bussiness_id, cookie_id, session_id, funnel_id, step_id分组,组内按
log_time升序排序就可以.
OK, 这样能够採用hive提供的distribute by 和 sort by,这样能够充分利用hadoop资源, 在多个
reduce中局部按log_time 排序
优化有的hive代码:
- INSERT OVERWRITE TABLE t_wa_funnel_distinct PARTITION (pt='2011.11.01.21')
- SELECT
- bussiness_id,
- cookie_id,
- session_id,
- funnel_id,
- group_first(funnel_name) funnel_name,
- step_id,
- group_first(step_name) step_name,
- group_first(log_type) log_type,
- group_first(url_pattern) url_pattern,
- group_first(url) url,
- group_first(refer) refer,
- group_first(log_time) log_time,
- group_first(is_new_visitor) is_new_visitor,
- group_first(is_mobile_traffic) is_mobile_traffic,
- group_first(is_bounce) is_bounce,
- group_first(campaign_name) campaign_name,
- group_first(group_name) group_name,
- group_first(slot_name) slot_name,
- group_first(source_type) source_type,
- group_first(next_page) next_page,
- group_first(continent) continent,
- group_first(sub_continent_region) sub_continent_region,
- group_first(country) country,
- group_first(region) region,
- group_first(city) city,
- group_first(language) language,
- group_first(browser) browser,
- group_first(os) os,
- group_first(screen_color) screen_color,
- group_first(screen_resolution) screen_resolution,
- group_first(flash_version) flash_version,
- group_first(java) java,
- group_first(host) host
- FROM
- ( SELECT *
- FROM r_wa_funnel
- WHERE pt='2011.11.01.21'
- distribute by bussiness_id, cookie_id, session_id, funnel_id, step_id sort by log_time ASC
- ) t1
- GROUP BY bussiness_id, cookie_id, session_id, funnel_id, step_id;
运行时间:
第一个须要运行6:43, 而优化有仅仅要运行0:35秒。性能得到大幅提升
最新文章
- TextInputLayout setError() setErrorEnable()
- android 开发中fragment 遇到的问题
- eclipse 启动报错\workspace\.metadata\.log
- python中%和format
- 发现一php木马代码
- Coder-Strike 2014 - Round 1 D. Giving Awards
- nginx反向代理编译异常
- 保存知乎收藏夹功能的NodeJS版本
- Selenium2.0介绍
- Windows下Sublime Text 默认打开方式问题解决办法
- CoreGraphics之CGContext绘图
- centos 7 安装 mariadb数据库
- 13-C语言字符串函数库
- 转:HTML与URL两种录制模式分析
- JavaScript中国象棋程序(0) - 前言
- 《Office 365 开发入门指南》公开邀请试读,欢迎反馈
- sha1 算法源码
- CSS基础一
- 利用Makefile安装helloworld模块(速成)
- NS3 MyApp Class Reference