Request received to kill task 'attempt_201411191723_2827635_r_000009_0' by user
-------
Task has been KILLED_UNCLEAN by the user 原因如下:
1.An impatient user (armed with "mapred job -kill-task" command)
2.JobTracker (to kill a speculative duplicate, or when a whole job fails)
3.Fair Scheduler (but diplomatically, it calls it “preemption”) 一篇老外的文章说的更详细: This is one of the most bloodcurling (and my favorites) stories, that we have recently seen in our 190-square-meter Hadoopland. In a nutshell, some jobs were surprisingly running extremely long, because thousands of their tasks were constantly being killed for some unknown reasons by someone (or something). For example, a photo, taken by our detectives, shows a job running for 12hrs:20min that spawned around 13,000 tasks until that moment. However (only) 4,118 of map tasks had finished successfully, while 8,708 were killed (!) and … surprisingly only 1 task failed (?) – obviously spreading panic in the Hadoopland. When murdering, the killer was leaving the same message each time: "KILLED_UNCLEAN by the user" (however, even our uncle competitor Google does not know too much what it exactly means ;)). Who is “the user”? Does the killer want to impersonate someone?

More Traces Of Crime

The detectives started looking for more traces of crime. They have noticed the killed tasks belong to ad-hoc Hive queries which are quite resource-intensive. When looking at timestamps in log files from JobTracker, TaskTracker and map tasks, they figured out that JobTracker got a request to murder the tasks… They have also noticed that tasks were usually killed young, quickly after the start (within 6-16 minutes), while the surviving tasks are running fine long hours.. The killer is unscrupulous!

Killer’s Identity

Who can actually send a kill request to JobTracker to murder thousands of tasks? Detectives quickly selected there main candidates:
  • An impatient user (armed with "mapred job -kill-task" command)
  • JobTracker (to kill a speculative duplicate, or when a whole job fails)
  • Fair Scheduler (but diplomatically, it calls it “preemption”)
When looking at log messages saying that a task is "KILLED UNCLEAN by the user", one could think that some user is a prime candidate to be the serial killer. However, the citizens of our Hadoopland are friendly, patient and respective to others, so that it would be unfair to assume that somebody killed, in cold blood, 8,708 tasks from a single jobs. JobTracker also seems to have a good alibi, because the job itself had not failed yet and the speculative execution was disabled (surprisingly Hive has own setting, hive.mapred.reduce.tasks.speculative.execution, for disabling speculative execution for reduce tasks, which is not overwritten by Hadoop’s mapred.reduce.tasks.speculative.execution).

FairScheduler Accused

For some company-specific reasons, the ad-hoc Hive queries are running as hive user in our Hadoopland. Moreover FairScheduler is configured with the default value of mapred.fairscheduler.poolnameproperty (which is user.name), so that the pools are created dynamically based on the username of user submitting the job to the cluster (“hive” in case of our ad-hoc Hive queries). When browsing one presentation about Hadoop 2 years ago, one of the detectives just remembered that FairScheduler is usually preempting the newest tasks in an over-share pool to forcibly make some room for starved pools. Eureka! ;) At this movement everything became clear and a quick look at FairScheduler webpage confirmed it. “Hive” pool was running over its minimum and fair shares for a long time, while the other pools are constantly running under their minimum and fair shares. In such a case, Fair Scheduler was killing Hive tasks from time to time to reassign slots to tasks from other pools.

Less Violence, More Peace

Having the evidence, we could put Fair Scheduler in prison, and use Capacity Scheduler instead. Maybe in the future, we will do that! Today, we believe that Fair Scheduler has not committed the crimes really intentionally – we feel that we have educated it badly and gave it too much power. Today, Fair Scheduler gets the suspended sentence – we want to give it a chance to rehabilitate and become more friendly and less aggressive… How to dignify the personality of Fair Scheduler? Obviously tuning settings like minSharePreemptionTimeout, fairSharePreemptionTimeout, minMaps and minReduces based on the current workload could be a good way to control the aggressiveness of the preemption of Fair Scheduler. Easier said, than done, because it requires a deep understanding of and knowledge about your workload (which later may change or not). There is a setting called mapred.fairscheduler.preemption that disables or enables preemption. However disabling preemption (or rather killing, to be precise), in our case, would just partially solve the problem. Only partially, because this issue exposed another problem in the Hadoopland – ad-hoc Hive queries are overloading the cluster.. Finally, we have not disabled preemption, because we were worrying a bit about SLA not being enforced without “any” preemption. Having this said, the two problems to solve are:
  • stop mass killing Hive tasks
  • stop overloading the cluster by ad-hoc Hive queries
We simply limited the number of map and reduce tasks that Fair Scheduler can run in Hive pool (by setting maxMaps and maxReduces for that pool). In consequence, Hive pool could not contain too many task, so that Fair Scheduler could not kill too many of them ;) (because Hive pool’s will not be operating (too much) above its min and fair share level). Limiting the number of tasks prevents also from overloading the cluster by Hive queries (additionally one could also set the maximum number of concurrent jobs running in Hive pool using maxRunningJobs). A nice thing to say is that Fair Scheduler is eager to cooperate, because changing the FairScheduler’s allocation file, does not require restarting of JobTracker. This file is automatically polled for changes every 10 seconds and if it has changed, it is reloaded and the pool configurations are updated on the fly. Thanks to that you can easily learn and change the personality of Fair Scheduler better. ;)
No related posts found.

最新文章

  1. Android之layout_weight解析
  2. 字符设备驱动之Led驱动学习记录
  3. hdu 4274 2012长春赛区网络赛 树形dp ***
  4. Android studio 签名使用转
  5. paip.python ide 总结最佳实践o4.
  6. Android AIDL使用详解
  7. 检测openOffice关闭 自动重启
  8. Java——(四)Collection之Set集合TreeSet类
  9. 微信小程序页面-页面跳转失败WAService.js:3 navigateTo:fail url not in app.json
  10. C++ 头文件系列(forward_list)
  11. lua 字符串
  12. git add * 提示warning: LF will be replaced by CRLF in 解决办法
  13. 小tips:JS之浅拷贝与深拷贝
  14. triangular distribution
  15. bug提单规范
  16. SQL Server 用角色(Role)管理数据库权限
  17. imaplib.error: command: SEARCH => got more than 10000 bytes
  18. JS正则验证邮箱的格式(转)
  19. 深度优先搜索之小z的房子与验证码识别
  20. 查找checked的checkbox和raido

热门文章

  1. java时间"yyyy-mm-dd HH:mm:ss"转成Date
  2. [Cracking the Coding Interview] 4.4 Check Balanced
  3. python学习之函数基础
  4. win7 下安装oracle 11g出现错误: 启动服务出现错误 找不到服务OracleMTSRecoveryService
  5. linux io 学习笔记(02)---条件变量,管道,信号
  6. Verilog 初级入门概念
  7. 修改mysql root密码的方法
  8. C#导出数据到CSV和EXCEL文件时数字文本被转义的解决方法
  9. 从浏览器或者Webview 中唤醒APP
  10. Linux上jdk的安装(CentOS6.5)