Hive详解(06) - Hive调优实战

执行计划（Explain）

基本语法

EXPLAIN [EXTENDED | DEPENDENCY | AUTHORIZATION] query

案例实操

（1）查看下面这条语句的执行计划

没有生成MR任务的

hive (default)> explain select * from emp;

Explain

STAGE DEPENDENCIES:

Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-0

Fetch Operator

limit: -1

Processor Tree:

TableScan

alias: emp

Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE

Select Operator

expressions: empno (type: int), ename (type: string), job (type: string), mgr (type: int), hiredate (type: string), sal (type: double), comm (type: double), deptno (type: int)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7

Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE

ListSink

有生成MR任务的

hive (default)> explain select deptno, avg(sal) avg_sal from emp group by deptno;

Explain

STAGE DEPENDENCIES:

Stage-1 is a root stage

Stage-0 depends on stages: Stage-1

STAGE PLANS:

Stage: Stage-1

Map Reduce

Map Operator Tree:

TableScan

alias: emp

Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE

Select Operator

expressions: sal (type: double), deptno (type: int)

outputColumnNames: sal, deptno

Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE

Group By Operator

aggregations: sum(sal), count(sal)

keys: deptno (type: int)

mode: hash

outputColumnNames: _col0, _col1, _col2

Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE

Reduce Output Operator

key expressions: _col0 (type: int)

sort order: +

Map-reduce partition columns: _col0 (type: int)

Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE

value expressions: _col1 (type: double), _col2 (type: bigint)

Execution mode: vectorized

Reduce Operator Tree:

Group By Operator

aggregations: sum(VALUE._col0), count(VALUE._col1)

keys: KEY._col0 (type: int)

mode: mergepartial

outputColumnNames: _col0, _col1, _col2

Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE

Select Operator

expressions: _col0 (type: int), (_col1 / _col2) (type: double)

outputColumnNames: _col0, _col1

Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE

File Output Operator

compressed: false

Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONE

table:

input format: org.apache.hadoop.mapred.SequenceFileInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Stage: Stage-0

Fetch Operator

limit: -1

Processor Tree:

ListSink

查看详细执行计划

hive (default)> explain extended select * from emp;

hive (default)> explain extended select deptno, avg(sal) avg_sal from emp group by deptno;

Fetch抓取

Fetch抓取是指，Hive中对某些情况的查询可以不必使用MapReduce计算。例如：SELECT * FROM employees;在这种情况下，Hive可以简单地读取employee对应的存储目录下的文件，然后输出查询结果到控制台。

在hive-default.xml.template文件中hive.fetch.task.conversion默认是more，老版本hive默认是minimal，该属性修改为more以后，在全局查找、字段查找、limit查找等都不走mapreduce。

<name>hive.fetch.task.conversion</name>

Expects one of [none, minimal, more].

Some select queries can be converted to single FETCH task minimizing latency.

Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins.

0. none : disable hive.fetch.task.conversion

1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only

2. more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)

</description>

</property>

1）案例实操：

（1）把hive.fetch.task.conversion设置成none，然后执行查询语句，都会执行mapreduce程序。

hive (default)> set hive.fetch.task.conversion=none;

hive (default)> select * from emp;

hive (default)> select ename from emp;

hive (default)> select ename from emp limit 3;

（2）把hive.fetch.task.conversion设置成more，然后执行查询语句，如下查询方式都不会执行mapreduce程序。

hive (default)> set hive.fetch.task.conversion=more;

hive (default)> select * from emp;

hive (default)> select ename from emp;

hive (default)> select ename from emp limit 3;

本地模式

大多数的Hadoop Job是需要Hadoop提供的完整的可扩展性来处理大数据集的。不过，有时Hive的输入数据量是非常小的。在这种情况下，为查询触发执行任务消耗的时间可能会比实际job的执行时间要多的多。对于大多数这种情况，Hive可以通过本地模式在单台机器上处理所有的任务。对于小数据集，执行时间可以明显被缩短。

用户可以通过设置hive.exec.mode.local.auto的值为true，来让Hive在适当的时候自动启动这个优化。

set hive.exec.mode.local.auto=true; //开启本地mr

//设置local mr的最大输入数据量，当输入数据量小于这个值时采用local mr的方式，默认为134217728，即128M

set hive.exec.mode.local.auto.inputbytes.max=50000000;

//设置local mr的最大输入文件个数，当输入文件个数小于这个值时采用local mr的方式，默认为4

set hive.exec.mode.local.auto.input.files.max=10;

1）案例实操：

（1）开启本地模式，并执行查询语句

hive (default)> set hive.exec.mode.local.auto=true;

hive (default)> select * from emp cluster by deptno;

Time taken: 1.328 seconds, Fetched: 14 row(s)

（2）关闭本地模式，并执行查询语句

hive (default)> set hive.exec.mode.local.auto=false;

hive (default)> select * from emp cluster by deptno;

Time taken: 20.09 seconds, Fetched: 14 row(s)

表的优化

小表大表Join(MapJoin)

将key相对分散，并且数据量小的表放在join的左边，这样可以有效减少内存溢出错误发生的几率；再进一步，可以使用map join让小的维度表（1000条以下的记录条数）先进内存。在map端完成join。

实际测试发现：新版的hive已经对小表JOIN大表和大表JOIN小表进行了优化。小表放在左边和右边已经没有明显区别。

案例实操

1）需求

测试大表JOIN小表和小表JOIN大表的效率

2）开启MapJoin参数设置

（1）设置自动选择Mapjoin

set hive.auto.convert.join = true; 默认为true

（2）大表小表的阈值设置（默认25M以下认为是小表）：

set hive.mapjoin.smalltable.filesize = 25000000;

3）MapJoin工作机制

4）建大表、小表和JOIN后表的语句

// 创建大表

create table bigtable(id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

// 创建小表

create table smalltable(id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

// 创建join后表的语句

create table jointable(id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

5）分别向大表和小表中导入数据

hive (default)> load data local inpath '/opt/module/hive/datas/bigtable' into table bigtable;

hive (default)>load data local inpath '/opt/module/hive/datas/smalltable' into table smalltable;

6）小表JOIN大表语句

insert overwrite table jointable

select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url

from smalltable s

join bigtable b

on b.id = s.id;

Time taken: 35.921 seconds

No rows affected (44.456 seconds)

7）执行大表JOIN小表语句

insert overwrite table jointable

select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url

from bigtable b

join smalltable s

on s.id = b.id;

Time taken: 34.196 seconds

No rows affected (26.287 seconds)

大表Join大表

1）空KEY过滤

有时join超时是因为某些key对应的数据太多，而相同key对应的数据都会发送到相同的reducer上，从而导致内存不够。此时应该仔细分析这些异常的key，很多情况下，这些key对应的数据是异常数据，需要在SQL语句中进行过滤。例如key对应的字段为空，操作如下：

案例实操

（1）配置历史服务器

配置mapred-site.xml

<name>mapreduce.jobhistory.address</name>

<value>hadoop102:10020</value>

</property>

<name>mapreduce.jobhistory.webapp.address</name>

<value>hadoop102:19888</value>

</property>

启动历史服务器

sbin/mr-jobhistory-daemon.sh start historyserver

查看jobhistory

http://hadoop102:19888/jobhistory

（2）创建原始数据表、空id表、合并后数据表

// 创建空id表

create table nullidtable(id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

（3）分别加载原始数据和空id数据到对应表中

hive (default)> load data local inpath '/opt/module/hive/datas/nullid' into table nullidtable;

（4）测试不过滤空id

hive (default)> insert overwrite table jointable select n.* from nullidtable n

left join bigtable o on n.id = o.id;

（5）测试过滤空id

hive (default)> insert overwrite table jointable select n.* from (select * from nullidtable where id is not null ) n left join bigtable o on n.id = o.id;

2）空key转换

有时虽然某个key为空对应的数据很多，但是相应的数据不是异常数据，必须要包含在join的结果中，此时可以表a中key为空的字段赋一个随机的值，使得数据随机均匀地分不到不同的reducer上。例如：

案例实操：

不随机分布空null值：

（1）设置5个reduce个数

set mapreduce.job.reduces = 5;

（2）JOIN两张表

insert overwrite table jointable

select n.* from nullidtable n left join bigtable b on n.id = b.id;

结果：如下图所示，可以看出来，出现了数据倾斜，某些reducer的资源消耗远大于其他reducer。