环境为伪分布。

1、下载相关文件:

lzo-2.10.tar.gz:https://www.oberhumer.com/opensource/lzo/

hadoop-lzo-master.zip:https://github.com/twitter/hadoop-lzo/archive/master.zip

2、Configure LZO to build a shared library (required) and use a package-specific prefix (optional but recommended):

[root@zgg opt]# tar -zxvf lzo-2.10.tar.gz
....
[root@zgg opt]# cd lzo-2.10
[root@zgg lzo-2.10]# ./configure --enable-shared --prefix /usr/local/lzo-2.10

3、Build and install LZO:

[root@zgg lzo-2.10]# make && sudo make install

如果是集群环境,编译完 lzo 包之后,将 /usr/local/lzo-2.10目录下生成的所有文件打包,并同步到集群其他节点。

4、安装 hadoop-lzo

[root@zgg opt]# unzip hadoop-lzo-master.zip
....
[root@zgg opt]# vi /etc/profile
....
export C_INCLUDE_PATH=/usr/local/lzo-2.10/include
export LIBRARY_PATH=/usr/local/lzo-2.10/lib
....
[root@zgg opt]# source /etc/profile
[root@zgg opt]# cd hadoop-lzo-master [root@zgg hadoop-lzo-master]# mvn clean package
....
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 16:46 min
[INFO] Finished at: 2021-01-10T14:08:16+08:00
[INFO] ------------------------------------------------------------------------ [root@zgg hadoop-lzo-master]# cd target/
[root@zgg target]# ls
antrun generated-sources hadoop-lzo-0.4.21-SNAPSHOT-sources.jar native
apidocs hadoop-lzo-0.4.21-SNAPSHOT.jar javadoc-bundle-options test-classes
classes hadoop-lzo-0.4.21-SNAPSHOT-javadoc.jar maven-archiver # 将`hadoop-lzo-0.4.21-SNAPSHOT.jar`复制到 .../common 目录下
[root@zgg hadoop-lzo-master]# cp target/hadoop-lzo-0.4.21-SNAPSHOT.jar /opt/hadoop-3.2.1/share/hadoop/common

如果是集群环境,需要将hadoop-lzo-0.4.21-SNAPSHOT.jar同步到集群其他节点。

5、配置 Hadoop 属性

hadoop-env.sh:

export LD_LIBRARY_PATH=/usr/local/lzo-2.10/lib

core-site.xml

<property>
<!-- 配置支持 LZO 压缩 -->
<name>io.compression.codecs</name>
<value>
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

mapred-site.xml

<property>
<!-- 启用map任务输出的压缩 -->
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<!-- map任务输出的压缩类型 -->
<name>mapred.map.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzopCodec</value>
</property> <property>
<!-- 启用job输出的压缩 -->
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
</property>
<property>
<!-- job输出的压缩类型,这里是LzopCodec -->
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>com.hadoop.compression.lzo.LzopCodec</value>
</property>
<property>
<name>mapred.child.env</name>
<value>LD_LIBRARY_PATH=/usr/local/lzo-2.10/lib</value>
</property>

如果是集群环境,需要将这些配置同步到集群其他节点。

6、测试

# 安装lzop
yum install lzop # 压缩文件
lzop wc.txt # 测试wordcount
[root@zgg target]# hadoop jar /opt/hadoop-3.2.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount /in/wc.txt.lzo /out/wc
....
【一个分片,未切片】
2021-01-10 15:54:36,249 INFO mapreduce.JobSubmitter: number of splits:1 [root@zgg target]# hadoop fs -ls /out/wc
Found 2 items
-rw-r--r-- 1 root supergroup 0 2021-01-10 16:36 /out/wc/_SUCCESS
-rw-r--r-- 1 root supergroup 91 2021-01-10 16:36 /out/wc/part-r-00000.lzo

7、LZO 创建索引

LZO 压缩文件的可切片特性依赖于其索引,故我们需要手动为 LZO 压缩文件创建索引。若无索引,则 LZO 文件的切片只有一个。

# 数据文件的目录是hdfs上的目录
# 【com.hadoop.compression.lzo.DistributedLzoIndexer】
[root@zgg target]# hadoop jar /opt/hadoop-lzo-master/target/hadoop-lzo-0.4.21-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer /in/wc.txt.lzo
2021-01-10 16:41:23,817 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
2021-01-10 16:41:23,820 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 5dbdddb8cfb544e58b4e0b9664b9d1b66657faf5]
2021-01-10 16:41:24,573 INFO lzo.LzoIndexer: [INDEX] LZO Indexing file /in/wc.txt.lzo, size 0.00 GB...
2021-01-10 16:41:24,659 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-01-10 16:41:24,736 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-01-10 16:41:24,802 INFO lzo.LzoIndexer: Completed LZO Indexing in 0.23 seconds (0.00 MB/s). Index size is 0.01 KB. # 查看
[root@zgg data]# hadoop fs -ls /in
Found 2 items
-rw-r--r-- 1 root supergroup 124 2021-01-10 16:13 /in/lzo/wc.txt.lzo
-rw-r--r-- 1 root supergroup 8 2021-01-10 16:13 /in/lzo/wc.txt.lzo.index # 测试
# 【输入路径也必须包含索引文件】
[root@zgg target]# hadoop jar /opt/hadoop-3.2.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount /in/lzo /out/wc
....
2021-01-10 16:43:41,854 INFO mapreduce.JobSubmitter: number of splits:2
.... [root@zgg target]# hadoop fs -ls /out/wc
Found 2 items
-rw-r--r-- 1 root supergroup 0 2021-01-10 16:44 /out/wc/_SUCCESS
-rw-r--r-- 1 root supergroup 102 2021-01-10 16:44 /out/wc/part-r-00000.lzo [root@zgg target]# hadoop fs -text /out/wc/part-r-00000.lzo
2021-01-10 16:44:47,453 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-01-10 16:44:47,521 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
2021-01-10 16:44:47,557 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 5dbdddb8cfb544e58b4e0b9664b9d1b66657faf5]
2021-01-10 16:44:47,562 INFO compress.CodecPool: Got brand-new decompressor [.lzo]
, 1
flink 170
hadoop 510
hello 340
spark 170

参考地址:

https://www.cnblogs.com/caoshouling/p/14091113.html

https://github.com/twitter/hadoop-lzo

最新文章

  1. VR、AR、MR的区别
  2. hdu 4898 The Revenge of the Princess’ Knight
  3. SVN更新报错
  4. HTML5的WebGL实现的3D和2D拓扑树
  5. codevs1910 递归函数
  6. c3p0配置
  7. POJ 3225 Help with Intervals
  8. JQuery基础教程:选择元素(中)
  9. LA3902 Network
  10. Redis 性能测试
  11. OneAlert 入门(二)——事件分析
  12. 工作中用到的简单linux命令
  13. PHP中的http协议
  14. Dell3470无法开机或开机黑屏情况下检测屏幕是否正常
  15. 代码管理工具Git的安装及使用
  16. CSS选择器 + Xpath + 正则表达式整理(有空再整理)
  17. VR外包—长年承接虚拟现实项目和AR外包游戏、软件(北京动点飞扬软件)
  18. github 的ssh key
  19. Windows 下最佳的 C++ 开发的 IDE 是什么?
  20. 第 15 章 位操作(dualview)

热门文章

  1. Memcached 缓存系统简介
  2. 用hyper-v创建虚拟机
  3. HTML定位
  4. 6127:Largest Average
  5. Pytest(16)随机执行测试用例pytest-random-order
  6. sqoop使用以及常见问题
  7. Mybatis学习笔记1
  8. Codeforces Round #673 (Div. 2) A. Copy-paste(贪心)
  9. 树链剖分(附带LCA和换根)——基于dfs序的树上优化
  10. Dapr微服务应用开发系列3:服务调用构件块