Hive Data Manipulation Language

There are two primary ways of modifying data in Hive:

Loading files into tables

Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.

Syntax
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
Synopsis

Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.

  • filepath can be:

    • a relative path, such as project/data1
    • an absolute path, such as /user/hive/project/data1
    • a full URI with scheme and (optionally) an authority, such as hdfs://namenode:9000/user/hive/project/data1
  • The target being loaded to can be a table or a partition. If the table is partitioned, then one must specify a specific partition of the table by specifying values for all of the partitioning columns.
  • filepath can refer to a file (in which case Hive will move the file into the table) or it can be a directory (in which case Hive will move all the files within that directory into the table). In either case, filepath addresses a set of files.
  • If the keyword LOCAL is specified, then:
    • the load command will look for filepath in the local file system. If a relative path is specified, it will be interpreted relative to the user's current working directory. The user can specify a full URI for local files as well - for example: file:///user/hive/project/data1
    • the load command will try to copy all the files addressed by filepath to the target filesystem. The target file system is inferred by looking at the location attribute of the table. The copied data files will then be moved to the table.
  • If the keyword LOCAL is not specified, then Hive will either use the full URI of filepath, if one is specified, or will apply the following rules:
    • If scheme or authority are not specified, Hive will use the scheme and authority from the hadoop configuration variable fs.default.name that specifies the Namenode URI.
    • If the path is not absolute, then Hive will interpret it relative to /user/<username>
    • Hive will move the files addressed by filepath into the table (or partition)
  • If the OVERWRITE keyword is used then the contents of the target table (or partition) will be deleted and replaced by the files referred to by filepath; otherwise the files referred by filepath will be added to the table.
    • Note that if the target table (or partition) already has a file whose name collides with any of the filenames contained in filepath, then the existing file will be replaced with the new file.
Notes
  • filepath cannot contain subdirectories.
  • If the keyword LOCAL is not given, filepath must refer to files within the same filesystem as the table's (or partition's) location.
  • Hive does some minimal checks to make sure that the files being loaded match the target table. Currently it checks that if the table is stored in sequencefile format, the files being loaded are also sequencefiles, and vice versa.
  • Please read CompressedStorage if your datafile is compressed.

Inserting data into Hive Tables from queries

Query Results can be inserted into tables by using the insert clause.

Syntax
Standard syntax:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement; Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...; Hive extension (dynamic partition inserts):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
Synopsis
  • INSERT OVERWRITE will overwrite any existing data in the table or partition

    • unless IF NOT EXISTS is provided for a partition (as of Hive 0.9.0)
  • INSERT INTO will append to the table or partition keeping the existing data in tact. (Note: INSERT INTO syntax is only available starting in version 0.8)
  • Inserts can be done to a table or a partition. If the table is partitioned, then one must specify a specific partition of the table by specifying values for all of the partitioning columns.
  • Multiple insert clauses (also known as Multi Table Insert) can be specified in the same query.
  • The output of each of the select statements is written to the chosen table (or partition). Currently the OVERWRITE keyword is mandatory and implies that the contents of the chosen table or partition are replaced with the output of corresponding select statement.
  • The output format and serialization class is determined by the table's metadata (as specified via DDL commands on the table).
Notes
  • Multi Table Inserts minimize the number of data scans required. Hive can insert data into multiple tables by scanning the input data just once (and applying different query operators) to the input data.
Dynamic Partition Inserts

Version information

Icon

This information reflects the situation in Hive 0.12; dynamic partition inserts were added in Hive 0.6.

In the dynamic partition inserts, users can give partial partition specifications, which means just specifying the list of partition column names in the PARTITION clause. The column values are optional. If a partition column value is given, we call this a static partition, otherwise it is a dynamic partition. Each dynamic partition column has a corresponding input column from the select statement. This means that the dynamic partition creation is determined by the value of the input column. The dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause.

Dynamic Partition inserts are disabled by default. These are the relevant configuration properties for dynamic partition inserts:

Configuration property

Default

Note

hive.exec.dynamic.partition

false

Needs to be set to true to enable dynamic partition inserts

hive.exec.dynamic.partition.mode

strict

In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions, in nonstrict mode all partitions are allowed to be dynamic

hive.exec.max.dynamic.partitions.pernode

100

Maximum number of dynamic partitions allowed to be created in each mapper/reducer node

hive.exec.max.dynamic.partitions

1000

Maximum number of dynamic partitions allowed to be created in total

hive.exec.max.created.files

100000

Maximum number of HDFS files created by all mappers/reducers in a MapReduce job

hive.error.on.empty.partition

false

Whether to throw an exception if dynamic partition insert generates empty results

Example
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt

Here the country partition will be dynamically created by the last column from the SELECT clause (i.e. pvs.cnt). Note that the name is not used. In nonstrict mode the dt partition could also be dynamically created.

Additional Documentation

Writing data into the filesystem from queries

Query results can be inserted into filesystem directories by using a slight variation of the syntax above:

Syntax
Standard syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
SELECT ... FROM ... Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ... row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char] (Note: Only available starting with Hive 0.13)
Synopsis
  • Directory can be a full URI. If scheme or authority are not specified, Hive will use the scheme and authority from the hadoop configuration variablefs.default.name that specifies the Namenode URI.
  • If LOCAL keyword is used, Hive will write data to the directory on the local file system.
  • Data written to the filesystem is serialized as text with columns separated by ^A and rows separated by newlines. If any of the columns are not of primitive type, then those columns are serialized to JSON format.
Notes
  • INSERT OVERWRITE statements to directories, local directories, and tables (or partitions) can all be used together within the same query.
  • INSERT OVERWRITE statements to HDFS filesystem directories are the best way to extract large amounts of data from Hive. Hive can write to HDFS directories in parallel from within a map-reduce job.
  • The directory is, as you would expect, OVERWRITten; in other words, if the specified path exists, it is clobbered and replaced with the output.
  • As of Hive 0.11.0 the separator used can be specified, in earlier versions it was always the ^A character (\001)
 

 

最新文章

  1. 【JAVA面试题系列一】面试题总汇--JAVA基础部分
  2. iOS——学习网址收集+如何提高iOS开发技能
  3. (转载)IO-同步、异步、阻塞、非阻塞
  4. C#引用Interop.SQLDMO.dll后的注意事项(转)
  5. Xfce 快捷键
  6. Case 架构的实际应用-1
  7. firebreath注册接口
  8. Struts2实现国际化
  9. 记录一次无聊的(经历了Nodejs -&gt; Shell -&gt; C)的探索问题过程
  10. Python处理json字符串转化为字典
  11. 【转载】netstat--查看服务器[有效]连接数--统计端口并发数--access.log分析
  12. SpringCloud应用入库后乱码问题
  13. 【图文详解】HDFS基本原理
  14. numpy数组扩展函数repeat和tile用法
  15. Linux下用rm删除的文件的恢复方法
  16. tcpdump抓包常用参数和用法
  17. android studio build.gradle 中的dependencies 的 compile jar文件
  18. 记spring mvc传入List&lt;Object&gt;的一次尝试
  19. 含有ref out 参数 的方法反射 Emit 与 普通
  20. 手机html根据手机分辨率网页文字大小自适应

热门文章

  1. Object类型的创建和访问
  2. pytorch中squeeze()和unsqueeze()函数介绍
  3. js对数组的元素分割
  4. 买房的贷款时间是否是越长越好?https://www.zhihu.com/question/20842791
  5. Vue 动画的钩子函数
  6. electron-vue 窗口拖拽及自定义边框,及关闭缩小放大化方法
  7. H3C RIP协议概述
  8. 2019-1-27-WPF-使用-ItemsPanel-修改方向
  9. linux版本依赖
  10. 纯CSS绘制的图形一览