Short Description:

The article talks about the basic health checks to be performed when working on issues related to slow zookeeper performance

Article

Zookeeper is one of the most critical components in an HDP cluster, but it is also one that is given least importance usually when tuning cluster for performance and while troubleshooting slowness in a cluster. Here is a basic checklist for zookeeper health check that one must go through to ensure that Zookeeper is running fine.

Let's keep the zookeeper happy to be able to better manage the occupants of the zoo :)

1. Are all the Zookeeper servers given dedicated disks for transaction log directory ('dataDir' / 'dataLogDir') ?

It is very important to have fast disks to complete 'fsync' of new transactions to the log, where zookeeper writes before any update takes place and before sending a response back to the client. Slower 'fsync' for transaction log is one of the most common reasons seen in the past for slower zookeeper response. Yes, the disk space requirement is usually not very high by the zookeeper and one might wonder if its worth to dedicate a complete disk to zookeeper log directory, but its required to prevent I/O operations by other applications/processes from keeping the disk busier.

Some of the common symptoms to be noticed if zookeeper finds slower writes to transactional log are:

  • Services such as NameNode zkfc and HBase Region servers, that uses ephemeral znodes to track its liveliness, shuts down after repeated zookeeper server connection timeouts.
  • The zookeeper server log frequently reports errors such as:

WARN [SyncThread:2:FileTxnLog@321] - fsync-ing the write ahead log in SyncThread:2 took 7050ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide

2. Is the zookeeper process given enough heap memory, according to the number of znodes, clients and watchers connecting the zookeepers.

To arrive at the right zookeeper heap size, one has to run load tests and find the estimate on required heap size. Insufficient memory allocation for zookeepers can affect its performance once it goes through very frequent GC cycles when the heap usage reaches close to 100% of its total heap size allocation. The following four letter zookeeper commands provide many useful information about the running zookeeper instances:

  1. # echo 'stat' | nc <ZK_HOST> 2181
  2. # echo 'mntr' | nc <ZK_HOST> 2181

In the above command output, watch for numbers against the stats such as znode count, number of watchers, number of client connections and max/avg latency among other things. In most cases a heap size between 2GB and 4GB should be a good, but as mentioned above, this depends on the kind of load on the zookeeper. In addition to the above mentioned 'four letter' commands, it is also recommended to keep an eye on the increasing heap size and the GCs, especially during the time of slowness, using tools such as:

  1. # sudo su - zookeeper ; jmap -heap <ZK_PID>
  2. # sudo su - zookeeper ; jstat -gcutil 2000 10 <ZK_PID>

3. Are there too many zookeepers in the ensemble ?

Three ZooKeeper servers is the minimum recommended size for an ensemble. And in most cases, three zookeepers are good enough too. Increased number of zookeeper servers, although gives more reliability (a 7 node ensemble can withstand loss of 3 nodes compared to the tolerance of 1 node loss in case of a 3 three node ensemble), and better read throughput when there are large number of concurrent clients connected, it can lead to slower write operations since every update/write operation is required to be committed by atleast half of the nodes in an ensemble.

Some alternatives to prevent the slower writes arising due to larger ensembles are:

  1. Use dedicated zookeeper ensemble for certain workloads in the cluster
  2. For larger ensemble, use zookeeper observers - Ref. http://zookeeper.apache.org/doc/trunk/zookeeperObservers.html (although configuration of zookeeper observer is not supported in the current Ambari version as of this writing).

4. Are the 'dataDir' / 'dataLogDir' filling up too fast ?

As mentioned above, every transaction to zookeepers are written to the transaction log file. When a large number of concurrent ZK clients continuously connects and does very frequent updates, possibly due to an error condition at the client, it can lead to the transaction logs getting rolled over multiple times in a minute due to its steadily increasing size and thus resulting in a large number of Snapshot files as well. This can further cause disks running out of free space.

For such issues, one has to identify and fix the client application. Review the stats from above in addition to zookeeper logs and/or the latest transaction log, to find the latest updates on the znodes using 'logFormatter' tool:

  1. # java -cp /usr/hdp/current/zookeeper-server/*:/usr/hdp/current/zookeeper-server/lib/* org.apache.zookeeper.server.LogFormatter /hadoop/zookeeper/version-2/log.xxxxx

Further, the zookeeper properties - 'autopurge.snapRetainCount' and 'autopurge.purgeInterval' have to be tuned according to the required retention count and the frequency to limit the increasing number of transaction log and snapshot files.

最新文章

  1. nginx rewrite
  2. ASP.NET最误导人的错误提示:“未预编译文件,因此不能请求该文件”
  3. ucenter 显示通信成功的条件
  4. ural1238. Folding(记忆化)
  5. 自己动手写谷歌API翻译接口
  6. ASP.NET 5简介
  7. CMA-连续内存分配
  8. Eclipse CDT开发环境搭建及问题记录(Windows)
  9. [NOI2012]
  10. SimpleDateFormat安全的时间格式化
  11. 加载Assetbundle需要注意的地方
  12. Django 执行单独脚本及SyntaxError缩进报错解决
  13. 常用 git 基础命令
  14. 配置web pack loader 报错:Module build failed: Error: The node API for `babel` has been moved to `babel-core`.
  15. [转帖]关于hostnamectl 命令
  16. 本地docker搭建gitlab, 并配置ldap认证
  17. 开源通用型渲染工具-SwiftShader--OpenGL的替代者
  18. Email feedback to product team about TFS and SharePoint Integration 2017.2.15
  19. jQuery 二级菜单,一次显示一个小类 鼠标点击显示小类
  20. Spring-Bean配置-使用外部属性文件(转)

热门文章

  1. C# 隐式转换关键字 implicit
  2. Android Studio 基础控件使用
  3. jquery中找到元素在数组中位置,添加或者删除元素的新方法
  4. vue-i18n和ElementUI国际化使用
  5. vue-基于elementui换肤
  6. angular分页插件tm.pagination 解决触发二次请求的问题
  7. angularJs学习笔记-入门
  8. css 表单标签两端对齐
  9. @RequestParam加与不加的区别
  10. iframe 标签属性解读