According to Wikipedia MapReduce, there are two ways to illustrate MapReduce. One contains three steps: Map, Shuffle and Reduce; Another one with 5 steps is my preference:

a. Prepare the Map() input,

b. Run the user-provided Map() code

c. "Shuffle" the Map output to the Reduce processors,

d. Run the user-provided Reduce() code,

e. Produce the final output

This blog focuses on how to prepare the Map() input:

1. Block and InputSplit:

As shown in the HDFS blogs, super huge dataset is physically stored in HDFS. But Mappers do not directly process physical blocks, instead InputSplits converts the physical representation of the block into logical for the Hadoop Mappers.

InputSplit  is the logical representation of data. It describes a unit of work that contains a single map task in a MapReduce program. It is created by InputFormat. FileInputFormat, by default, breaks a file into 128MB chunks (same as blocks in HDFS),framework assigns one split to each Map function. Inputsplit does not contain the input data; it is just a reference to the data.

2. RecordReader:

It determines how an InputSplit is passed into a Map function. The RecordReader instance is defined by the InputFormat. By default, it uses TextInputFormat for converting data into a key-value pair. TextInputFormat provides 2 types of RecordReaders: LineRecordReader, SequenceFileRecordReader

References:

https://hadoopabcd.wordpress.com/2015/03/10/hdfs-file-block-and-input-split/

https://en.wikipedia.org/wiki/MapReduce

https://data-flair.training/blogs/shuffling-and-sorting-in-hadoop/

https://zhuanlan.zhihu.com/p/34849261

https://www.edureka.co/blog/mapreduce-tutorial/

最新文章

  1. VS2015开发Android,自带模拟器无法调试、加载程序,算是坑吗
  2. 当Table中td内容为空时,显示边框的办法
  3. 解决Java接口内部类的main()方法无法打印输出的问题
  4. JSON.stringify()、JSON.parse()和eval(string)
  5. 我的WCF Data Service 系列 (一、为什么要有WCF Data Service)
  6. 分享Kali Linux 2016.2第43周镜像
  7. Windows CMD命令大全(转)
  8. 找区间连续值(HDU5247)
  9. 算法导论----VLSI芯片测试; n个手机中过半是好的,找出哪些是好手机
  10. JavaScript面向对象编程指南
  11. hadoop-2.0.0-mr1-cdh4.2.0源码编译总结
  12. python socketserver框架解析
  13. 折腾Java设计模式之备忘录模式
  14. 聚类——FCM
  15. SMTP 协议
  16. Python多进程操作同一个文件,文件锁问题
  17. hdu 4277 USACO ORZ dfs+hash
  18. 自定义ListView 、GradView 重写onMeasure方法让其正确显示
  19. JSP概述、API、注释
  20. collections集合的总括。

热门文章

  1. 本地SVN服务器的搭建(WINDOWS环境)
  2. Python 中的 os 模块常见方法?
  3. bzoj3188 [Coci 2011]Upit(分块)
  4. 回溯---Permutations II
  5. Python2 安装教程
  6. 奇虎360的开源OpenResty Windows版本
  7. css3 :enabled与:disabled伪类选择器
  8. TreeMap和Comparable接口
  9. 使用rabbitctl添加用户
  10. django 多表查询并返回结果