HIVE JOIN

概述

Hive join的实现包含了：

Common (Reduce-side) Join
Broadcast (Map-side) Join
Bucket Map Join
Sort Merge Bucket Join
Skew Join

这里记录下前两种.

第一种是common join，就像字面意思那样，它是一种最常见的join实现方式，但是不够灵活，并且性能也不够好。

一个common join包含了一个map阶段和一个shuffle阶段，以及一个reduce阶段。Map阶段会生成根据join的条件生成所需要的join key

和join value，并将这些信息保存在中间文件中。 Shuffle阶段会对这些文件按照join key进行排序，并且将key相同的数据合并到一个文件

中。Ruduce会进行最终的合并，并产生结果数据。

第二种是broadcast join，这种方式是取消shuffle和reduce阶段，将join动作在map 阶段完成，它会将join中的小表加载到内存中，所有

mapper都可以直接使用内存中的表数据进行join。所有的join 动作都可以在map阶段完成。

如何将小表加载到内存中也是挺讲究的，先要讲小表加载到内存中，然后将其序列化到一个hashtable file。当map阶段开始的时候，将这个

hashtable file 加载到distributed cache中，并将其分发到每个mapper所在的硬盘里，然后这些mapper将hashtable file加载到内存中，并进行join运算。通过优化，这些小表只需要读一次就OK，如果很多个mappper在同一台机器上，那么就只需要一个份hashtable file。

通过EXPLAIN查看

准备了两张表，分别是test_a和test_city。

test_a的数据如下：

test_a.id	test_a.uid	test_a.city_id
1	1	1
2	2	2
3	3	3

test_city的数据如下：

test_city.id	test_city.name
1	beijing
2	shanghai
3	hangzhou

LEFT JOIN

具体的SQL如下：

explain

select a.id, a.uid, b.name

from

    temp.test_a as a

left join

    temp.test_city as b

on a.city_id = b.id;

因为表很小，所以就使用了 map side join,具体过程如下：

STAGE DEPENDENCIES:

2	  Stage-4 is a root stage

3	  Stage-3 depends on stages: Stage-4

4	  Stage-0 depends on stages: Stage-3

5

6	STAGE PLANS:

7	  Stage: Stage-4

8	    Map Reduce Local Work

9	      Alias -> Map Local Tables://从文件中读取数据

10	        $hdt$_1:b

11	          Fetch Operator

12	            limit: -1

13	      Alias -> Map Local Operator Tree:

14	        $hdt$_1:b

15	          TableScan //扫描表 test_city，一行一行读取数据

16	            alias: b

17	            Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE

18	            Select Operator //选取数据

19	              expressions: id (type: bigint), name (type: string)

20	              outputColumnNames: _col0, _col1

21	              Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE

22	              HashTable Sink Operator //我理解这里应该在将数据放到distribute cache中所用到的key，但是不是很确定。

23	                keys:

24	                  0 _col2 (type: bigint)

25	                  1 _col0 (type: bigint)

26

27	  Stage: Stage-3

28	    Map Reduce

29	      Map Operator Tree:

30	          TableScan

31	            alias: a

32	            Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE

33	            Select Operator

34	              expressions: id (type: bigint), uid (type: bigint), city_id (type: bigint)

35	              outputColumnNames: _col0, _col1, _col2

36	              Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE

37	              Map Join Operator //注意这里用到了map side join

38	                condition map:

39	                     Left Outer Join0 to 1

40	                keys:

41	                  0 _col2 (type: bigint)

42	                  1 _col0 (type: bigint)

43	                outputColumnNames: _col0, _col1, _col4

44	                Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

45	                Select Operator

46	                  expressions: _col0 (type: bigint), _col1 (type: bigint), _col4 (type: string)

47	                  outputColumnNames: _col0, _col1, _col2

48	                  Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

49	                  File Output Operator

50	                    compressed: false

51	                    Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

52	                    table:

53	                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat

54	                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

55	                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

56	      Local Work:

57	        Map Reduce Local Work

58

59	  Stage: Stage-0

60	    Fetch Operator

61	      limit: -1

62	      Processor Tree:

63	        ListSink

如果设置了

set hive.auto.convert.join=false;

就会变为 Reduce-side join. 这是最普遍用到的join实现。整个过程包含了两部分：

STAGE DEPENDENCIES:

2	  Stage-1 is a root stage

3	  Stage-0 depends on stages: Stage-1

4

5	STAGE PLANS:

6	  Stage: Stage-1

7	    Map Reduce

8	      Map Operator Tree: //map过程

9	          TableScan

10	            alias: a

11	            Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE

12	            Select Operator

13	              expressions: id (type: bigint), uid (type: bigint), city_id (type: bigint)

14	              outputColumnNames: _col0, _col1, _col2

15	              Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE

16	              Reduce Output Operator //map端的Reduce，然后输出到reduce整体的Reduce阶段

17	                key expressions: _col2 (type: bigint)

18	                sort order: +

19	                Map-reduce partition columns: _col2 (type: bigint)

20	                Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE

21	                value expressions: _col0 (type: bigint), _col1 (type: bigint)

22	          TableScan

23	            alias: b

24	            Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE

25	            Select Operator

26	              expressions: id (type: bigint), name (type: string)

27	              outputColumnNames: _col0, _col1

28	              Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE

29	              Reduce Output Operator

30	                key expressions: _col0 (type: bigint)

31	                sort order: +

32	                Map-reduce partition columns: _col0 (type: bigint)

33	                Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE

34	                value expressions: _col1 (type: string)

35	      Reduce Operator Tree:

36	        Join Operator

37	          condition map:

38	               Left Outer Join0 to 1

39	          keys:

40	            0 _col2 (type: bigint)

41	            1 _col0 (type: bigint)

42	          outputColumnNames: _col0, _col1, _col4

43	          Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

44	          Select Operator

45	            expressions: _col0 (type: bigint), _col1 (type: bigint), _col4 (type: string)

46	            outputColumnNames: _col0, _col1, _col2

47	            Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

48	            File Output Operator

49	              compressed: false

50	              Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

51	              table:

52	                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat

53	                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

54	                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

55

56	  Stage: Stage-0

57	    Fetch Operator

58	      limit: -1

59	      Processor Tree:

60	        ListSink

参考：

巴特西

HIVE JOIN_1

HIVE JOIN

概述

通过EXPLAIN查看

LEFT JOIN

最新文章

热门文章