Hadoop中决定map个数的的因素有几个,由于版本的不同,决定因素也不一样,掌握这些因素对了解hadoop分片的划分有很大帮助,

并且对优化hadoop性能也很有大的益处。

旧API中getSplits方法:

 public InputSplit[] getSplits(JobConf job, int numSplits)
throws IOException {
FileStatus[] files = listStatus(job); // Save the number of input files in the job-conf
job.setLong(NUM_INPUT_FILES, files.length);
long totalSize = 0; // compute total size
for (FileStatus file: files) { // check we have valid files
if (file.isDir()) {
throw new IOException("Not a file: "+ file.getPath());
}
totalSize += file.getLen();
} long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
long minSize = Math.max(job.getLong("mapred.min.split.size", 1),
minSplitSize); // generate splits
ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
NetworkTopology clusterMap = new NetworkTopology();
for (FileStatus file: files) {
Path path = file.getPath();
FileSystem fs = path.getFileSystem(job);
long length = file.getLen();
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
if ((length != 0) && isSplitable(fs, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(goalSize, minSize, blockSize); long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
String[] splitHosts = getSplitHosts(blkLocations,
length-bytesRemaining, splitSize, clusterMap);
splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
splitHosts));
bytesRemaining -= splitSize;
} if (bytesRemaining != 0) {
splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,
blkLocations[blkLocations.length-1].getHosts()));
}
} else if (length != 0) {
String[] splitHosts = getSplitHosts(blkLocations,0,length,clusterMap);
splits.add(new FileSplit(path, 0, length, splitHosts));
} else {
//Create empty hosts array for zero length files
splits.add(new FileSplit(path, 0, length, new String[0]));
}
}
LOG.debug("Total # of splits: " + splits.size());
return splits.toArray(new FileSplit[splits.size()]);
} protected long computeSplitSize(long goalSize, long minSize,
long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
}

新API中getSplits方法:

 public List<InputSplit> getSplits(JobContext job
) throws IOException {
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
long maxSize = getMaxSplitSize(job); // generate splits
List<InputSplit> splits = new ArrayList<InputSplit>();
List<FileStatus>files = listStatus(job);
for (FileStatus file: files) {
Path path = file.getPath();
FileSystem fs = path.getFileSystem(job.getConfiguration());
long length = file.getLen();
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
if ((length != 0) && isSplitable(job, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(blockSize, minSize, maxSize); long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts()));
bytesRemaining -= splitSize;
} if (bytesRemaining != 0) {
splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,
blkLocations[blkLocations.length-1].getHosts()));
}
} else if (length != 0) {
splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));
} else {
//Create empty hosts array for zero length files
splits.add(new FileSplit(path, 0, length, new String[0]));
}
} // Save the number of input files in the job-conf
job.getConfiguration().setLong(NUM_INPUT_FILES, files.size()); LOG.debug("Total # of splits: " + splits.size());
return splits;
} protected long computeSplitSize(long blockSize, long minSize,
long maxSize) {
return Math.max(minSize, Math.min(maxSize, blockSize));
}

测试一个输入文件大小为:0.52 KB 日志如下:

new :
blockSize:67108864 minSize:1 maxSize:9223372036854775807
splitSize:67108864

决定因素为 blockSize的大小.这个很容易理解

old:
blockSize:67108864 totalSize:529 numSplits:2 goalSize:264 minSplitSize:1 minSize:1
splitSize:264

numSplits为2,这个是在调用getSplits中传入的,这个地方要注意,经过查找发现这个参数为job.getNumMapTasks()的值如下

JobConf: public int getNumMapTasks() { return getInt("mapred.map.tasks", 1); }

mapred-default.xml中:

<property>
<name>mapred.map.tasks</name>
<value>2</value>
<description>The default number of map tasks per job.
Ignored when mapred.job.tracker is "local".
</description>
</property>

所以使用旧的API编写的MP程序,会产生2个map,而使用新的API则会产生1个map.

最新文章

  1. new(C# 参考)
  2. 百度地图JavaScript API覆盖物旋转时出现偏移
  3. 一次性搞清楚equals和hashCode
  4. 分析和解析PHP代码的7大工具
  5. [目录][总结] C++和Java 中的主要操作对比
  6. Linux下C程序的存储空间布局
  7. Ubuntu 17.04 安装
  8. C++ 默认参数(转载)
  9. Docker 生态概览
  10. Python二级-----------程序冲刺3
  11. 自托管websocket和webapi部署云服务器域名及远程访问
  12. nowcoder16450 托米的简单表示法
  13. java核心技术-(总结自杨晓峰-java核心技术36讲)
  14. Go etcd初探
  15. zombodb 聚合函数
  16. P4316 绿豆蛙的归宿(期望)
  17. appium-Could not obtain screenshot: [object Object]
  18. MyBatis连接SQL Server的关键点
  19. vscode 中使用 csscomb
  20. 【单调队列】【P3957】 跳房子

热门文章

  1. cookies封装
  2. Docker系列02: 容器生命周期管理 镜像&amp;容器
  3. 蚂蚁社招Java-第四轮电话面试【技术终面】
  4. urllib2 Handler处理器和自定义opener(六)
  5. Rhythmk 学习 Hibernate 09 - Hibernate HQL
  6. 创建标签的两种方法insertAdjacentHTML 和 createElement 创建标签 setAttribute 赋予标签类型 appendChild 插入标签
  7. CHEMISTS DISCOVER A SAFE, GREEN METHOD TO PROCESS RED PHOSPHORUS
  8. leetcode 7 reverse integer 反转整数
  9. Qt Read and Write Csv File
  10. JavaScript Math.floor() 方法