HIVE JOIN

概述

Hive join的实现包含了:

  • Common (Reduce-side) Join
  • Broadcast (Map-side) Join
  • Bucket Map Join
  • Sort Merge Bucket Join
  • Skew Join

这里记录下前两种.

第一种是common join,就像字面意思那样,它是一种最常见的join实现方式,但是不够灵活,并且性能也不够好。

一个common join包含了一个map阶段和一个shuffle阶段,以及一个reduce阶段。Map阶段会生成根据join的条件生成所需要的join key

和join value,并将这些信息保存在中间文件中。 Shuffle阶段会对这些文件按照join key进行排序,并且将key相同的数据合并到一个文件

中。Ruduce会进行最终的合并,并产生结果数据。

第二种是broadcast join,这种方式是取消shuffle和reduce阶段, 将join动作在map 阶段完成, 它会将join中的小表加载到内存中,所有

mapper都可以直接使用内存中的表数据进行join。所有的join 动作都可以在map阶段完成。

如何将小表加载到内存中也是挺讲究的,先要讲小表加载到内存中,然后将其序列化到一个hashtable file。当map阶段开始的时候,将这个

hashtable file 加载到distributed cache中,并将其分发到每个mapper所在的硬盘里,然后这些mapper将hashtable file加载到内存中,并进行join运算。通过优化,这些小表只需要读一次就OK,如果很多个mappper在同一台机器上,那么就只需要一个份hashtable file。

通过EXPLAIN查看

准备了两张表,分别是test_atest_city

  • test_a的数据如下:
test_a.id test_a.uid test_a.city_id
1 1 1
2 2 2
3 3 3
  • test_city的数据如下:
test_city.id test_city.name
1 beijing
2 shanghai
3 hangzhou

LEFT JOIN

具体的SQL如下:

explain
select a.id, a.uid, b.name
from
temp.test_a as a
left join
temp.test_city as b
on a.city_id = b.id;

因为表很小,所以就使用了 map side join,具体过程如下:

STAGE DEPENDENCIES:
2 Stage-4 is a root stage
3 Stage-3 depends on stages: Stage-4
4 Stage-0 depends on stages: Stage-3
5
6 STAGE PLANS:
7 Stage: Stage-4
8 Map Reduce Local Work
9 Alias -> Map Local Tables://从文件中读取数据
10 $hdt$_1:b
11 Fetch Operator
12 limit: -1
13 Alias -> Map Local Operator Tree:
14 $hdt$_1:b
15 TableScan //扫描表 test_city,一行一行读取数据
16 alias: b
17 Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE
18 Select Operator //选取数据
19 expressions: id (type: bigint), name (type: string)
20 outputColumnNames: _col0, _col1
21 Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE
22 HashTable Sink Operator //我理解这里应该在将数据放到distribute cache中所用到的key,但是不是很确定。
23 keys:
24 0 _col2 (type: bigint)
25 1 _col0 (type: bigint)
26
27 Stage: Stage-3
28 Map Reduce
29 Map Operator Tree:
30 TableScan
31 alias: a
32 Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE
33 Select Operator
34 expressions: id (type: bigint), uid (type: bigint), city_id (type: bigint)
35 outputColumnNames: _col0, _col1, _col2
36 Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE
37 Map Join Operator //注意这里用到了map side join
38 condition map:
39 Left Outer Join0 to 1
40 keys:
41 0 _col2 (type: bigint)
42 1 _col0 (type: bigint)
43 outputColumnNames: _col0, _col1, _col4
44 Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE
45 Select Operator
46 expressions: _col0 (type: bigint), _col1 (type: bigint), _col4 (type: string)
47 outputColumnNames: _col0, _col1, _col2
48 Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE
49 File Output Operator
50 compressed: false
51 Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE
52 table:
53 input format: org.apache.hadoop.mapred.SequenceFileInputFormat
54 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
55 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
56 Local Work:
57 Map Reduce Local Work
58
59 Stage: Stage-0
60 Fetch Operator
61 limit: -1
62 Processor Tree:
63 ListSink

如果设置了

set hive.auto.convert.join=false;

就会变为 Reduce-side join. 这是最普遍用到的join实现。整个过程包含了两部分:

STAGE DEPENDENCIES:
2 Stage-1 is a root stage
3 Stage-0 depends on stages: Stage-1
4
5 STAGE PLANS:
6 Stage: Stage-1
7 Map Reduce
8 Map Operator Tree: //map过程
9 TableScan
10 alias: a
11 Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE
12 Select Operator
13 expressions: id (type: bigint), uid (type: bigint), city_id (type: bigint)
14 outputColumnNames: _col0, _col1, _col2
15 Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE
16 Reduce Output Operator //map端的Reduce,然后输出到reduce整体的Reduce阶段
17 key expressions: _col2 (type: bigint)
18 sort order: +
19 Map-reduce partition columns: _col2 (type: bigint)
20 Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE
21 value expressions: _col0 (type: bigint), _col1 (type: bigint)
22 TableScan
23 alias: b
24 Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE
25 Select Operator
26 expressions: id (type: bigint), name (type: string)
27 outputColumnNames: _col0, _col1
28 Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE
29 Reduce Output Operator
30 key expressions: _col0 (type: bigint)
31 sort order: +
32 Map-reduce partition columns: _col0 (type: bigint)
33 Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE
34 value expressions: _col1 (type: string)
35 Reduce Operator Tree:
36 Join Operator
37 condition map:
38 Left Outer Join0 to 1
39 keys:
40 0 _col2 (type: bigint)
41 1 _col0 (type: bigint)
42 outputColumnNames: _col0, _col1, _col4
43 Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE
44 Select Operator
45 expressions: _col0 (type: bigint), _col1 (type: bigint), _col4 (type: string)
46 outputColumnNames: _col0, _col1, _col2
47 Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE
48 File Output Operator
49 compressed: false
50 Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE
51 table:
52 input format: org.apache.hadoop.mapred.SequenceFileInputFormat
53 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
54 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
55
56 Stage: Stage-0
57 Fetch Operator
58 limit: -1
59 Processor Tree:
60 ListSink

参考:

  1. HIVE JOIN 官方文档

  2. Hive 中的 LEFT SEMI JOIN 与 JOIN ON 的前世今生

  3. MapReduce 中的两表 join 几种方案简介

  4. 几种 hive join 类型简介

  5. Hive join ppt

  6. A Comparison of Join Algorithms for Log Processing inMapReduce

  7. Hive sql 生成的MR

  8. common join and map side join

最新文章

  1. BZOJ 3156 防御准备
  2. 部署点评Cat监控项目(转)
  3. 关于消除MySQL输入错误后的警报声
  4. Java中的10颗语法糖
  5. libvirt虚拟系统如何增加usb设备
  6. Java实现配置加载机制
  7. Hibernate + MySQL中文乱码问题
  8. BZOJ 4518 [Sdoi2016]征途(分治DP)
  9. setBit testBit权限管理(shiro项目中来的二)
  10. Spring webflux
  11. 仿9GAG制作过程(四)
  12. MySQL Hardware--NUMA与MySQL
  13. Springboot整合Websocket遇到的坑
  14. 【CF1009F】Dominant Indices(长链剖分)
  15. 生产redis client 链接报:ERR max number of clients reached 含义: 达到最大客户端数错误
  16. UVA 10100 Longest Match
  17. 如何为TreeView定义三层模板并实现数据绑定
  18. hihoCoder挑战赛19 A.Rikka with Sequence(状压DP)
  19. 认识ZTree
  20. Http File Server小工具

热门文章

  1. thttpd 在S3C6410的移植-web服务程序的应用
  2. 紫书 例题11-9 UVa 1658 (拆点+最小费用流)
  3. 在pycharm中配置Anaconda以及pip源配置
  4. Fragmen直接来回切换deno
  5. 即将到来的Autodesk 主要产品2015版 产品和API新功能在线培训(免费)
  6. glm编译错误问题解决 formal parameter with __declspec(align('16')) won't be aligned
  7. 12、NIO、AIO、BIO一
  8. 使用具体解释及源代码解析Android中的Adapter、BaseAdapter、ArrayAdapter、SimpleAdapter和SimpleCursorAdapter
  9. SqlServer 错误日志切换和查看
  10. 转载-让PIP源使用国内镜像,提升下载速度和安装成功率