Syntax of Order By

The ORDER BY syntax in Hive QL is similar to the syntax of ORDER BY in SQL language.

colOrder: ( ASC | DESC )
orderBy: ORDER BY colName colOrder? (',' colName colOrder?)*
query: SELECT expression (',' expression)* FROM src orderBy

There are some limitations in the "order by" clause. In the strict mode (i.e., hive.mapred.mode=strict), the order by clause has to be followed by a "limit" clause. (如果hive.mapred.mode=strict,那么order by 必须和limit一起使用)The limit clause is not necessary if you set hive.mapred.mode to nonstrict. The reason is that in order to impose total order of all results, there has to be one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could take a very long time to finish.(排序数据是由一个reducer完成并且最终输出,如果没有limit限制那么,那么整个操作必须相当长的时间才能完成。)

Syntax of Sort By

The SORT BY syntax is similar to the syntax of ORDER BY in SQL language.

colOrder: ( ASC | DESC )
sortBy: SORT BY colName colOrder? (',' colName colOrder?)*
query: SELECT expression (',' expression)* FROM src sortBy

Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.

Difference between Sort By and Order By

Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.

Note: It may be confusing as to the difference between SORT BY alone of a single column and CLUSTER BY. The difference is that CLUSTER BY partitions by the field and SORT BY if there are multiple reducers partitions randomly in order to distribute data (and load) uniformly across the reducers.

Basically, the data in each reducer will be sorted according to the order that the user specified. The following example shows

SELECT key, value FROM src SORT BY key ASC, value DESC

The query had 2 reducers, and the output of each is:

0   5
0   3
3   6
9   1
0   4
0   3
1   1
2   5

Setting Types for Sort By

After a transform, variable types are generally considered to be strings, meaning that numeric data will be sorted lexicographically. To overcome this, a second SELECT statement with casts can be used before using SORT BY.

FROM (FROM (FROM src
            SELECT TRANSFORM(value)
            USING 'mapper'
            AS value, count) mapped
      SELECT cast(value as double) AS value, cast(count as int) AS count
      SORT BY value, count) sorted
SELECT TRANSFORM(value, count)
USING 'reducer'
AS whatever

Syntax of Cluster By and Distribute By

Cluster By and Distribute By are used mainly with the Transform/Map-Reduce Scripts. But, it is sometimes useful in SELECT statements if there is a need to partition and sort the output of a query for subsequent queries.

Cluster By is a short-cut for both Distribute By and Sort By.

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However,Distribute By does not guarantee clustering or sorting properties on the distributed keys.

For example, we are Distributing By x on the following 5 rows to 2 reducer:

x1
x2
x4
x3
x1

Reducer 1 got

x1
x2
x1

Reducer 2 got

x4
x3

Note that all rows with the same key x1 is guaranteed to be distributed to the same reducer (reducer 1 in this case), but they are not guaranteed to be clustered in adjacent positions.

In contrast, if we use Cluster By x, the two reducers will further sort rows on x:

Reducer 1 got

x1
x1
x2

Reducer 2 got

x3
x4

Instead of specifying Cluster By, the user can specify Distribute By and Sort By, so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required.

SELECT col1, col2 FROM t1 CLUSTER BY col1
SELECT col1, col2 FROM t1 DISTRIBUTE BY col1
 
SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1 ASC, col2 DESC
FROM (
  FROM pv_users
  MAP ( pv_users.userid, pv_users.date )
  USING 'map_script'
  AS c1, c2, c3
  DISTRIBUTE BY c2
  SORT BY c2, c1) map_output
INSERT OVERWRITE TABLE pv_users_reduced
  REDUCE ( map_output.c1, map_output.c2, map_output.c3 )
  USING 'reduce_script'
  AS date, count;
 

最新文章

  1. linux基础知识3_根文件系统详解
  2. Servlet监听器
  3. MMORPG大型游戏设计与开发(客户端架构 part11 of vegine)
  4. .NET Framework各版本汇总以及之间的关系
  5. db2日期和时间常用汇总
  6. Java[1] Java学习书籍汇总(转)
  7. 探索PHP+Nginx(一) 安装Linux操作系统
  8. 【BZOJ3627】【JLOI2014】路径规划 分层图
  9. java socket解析和发送二进制报文工具(附java和C++转化问题)
  10. Swift 入门之简单语法(四)
  11. c++-STL:删除子串
  12. Spark MLlib数据类型
  13. http初探
  14. 2018-2019-2 网络对抗技术 20165308 Exp2 后门原理与实践
  15. android checkbox自定义(文字位置、格式等)
  16. JavaSE基础知识(7)—常用类
  17. codeforces982F
  18. jenkins-ant-jmeter
  19. pyqt5在xp下的配置
  20. VM虚拟机打不开,没有反应,解决方法。

热门文章

  1. 获取腾讯soso地图坐标代码
  2. linux 免交互状态下修改用户密码
  3. 神经网络指南Hacker's guide to Neural Networks
  4. Classpath entry org.eclipse.m2e.MAVEN2_CLASSPATH_CONTAINER will not be exported
  5. Github欢乐多 PHP神级代码引发吐槽热
  6. VS2012 professional和VS2012 Ultimate的区别
  7. bzoj2242: [SDOI2011]计算器 && BSGS 算法
  8. Spring MVC 的请求参数获取的几种方法
  9. BZOJ3856: Monster
  10. Error: "DEVELOPER_DIR" is not defined at ./symbolicatecrash line 53