folly/ThreadCachedInt.h

High-performance atomic increment using thread caching.

folly/ThreadCachedInt.h introduces a integer class designed for high performance increments from multiple threads simultaneously without loss of precision. It has two read modes, readFast gives a potentially stale value with one load, and readFull gives the exact value, but is much slower, as discussed below.

Performance


Increment performance is up to 10x greater than std::atomic_fetch_add in high contention environments. See folly/test/ThreadCachedIntTest.h for more comprehensive benchmarks.

readFast is as fast as a single load.

readFull, on the other hand, requires acquiring a mutex and iterating through a list to accumulate the values of all the thread local counters, so is significantly slower than readFast.

Usage


Create an instance and increment it with increment or the operator overloads. Read the value with readFast for quick, potentially stale data, or readFull for a more expensive but precise result. There are additional convenience functions as well, such as set.

    ThreadCachedInt<int64_t> val;
EXPECT_EQ(, val.readFast());
++val; // increment in thread local counter only
EXPECT_EQ(, val.readFast()); // increment has not been flushed
EXPECT_EQ(, val.readFull()); // accumulates all thread local counters
val.set();
EXPECT_EQ(, val.readFast());
EXPECT_EQ(, val.readFull());

Implementation


folly::ThreadCachedInt uses folly::ThreadLocal to store thread specific objects that each have a local counter. When incrementing, the thread local instance is incremented. If the local counter passes the cache size, the value is flushed to the global counter with an atomic increment. It is this global counter that is read with readFast via a simple load, but will not count any of the updates that haven't been flushed.

In order to read the exact value, ThreadCachedInt uses the extended readAllThreads() API of folly::ThreadLocal to iterate through all the references to all the associated thread local object instances. This currently requires acquiring a global mutex and iterating through the references, accumulating the counters along with the global counter. This also means that the first use of the object from a new thread will acquire the mutex in order to insert the thread local reference into the list. By default, there is one global mutex per integer type used in ThreadCachedInt. If you plan on using a lot of ThreadCachedInts in your application, considering breaking up the global mutex by introducing additional Tag template parameters.

set simply sets the global counter value, and marks all the thread local instances as needing to be reset. When iterating with readFull, thread local counters that have been marked as reset are skipped. When incrementing, thread local counters marked for reset are set to zero and unmarked for reset.

Upon destruction, thread local counters are flushed to the parent so that counts are not lost after increments in temporary threads. This requires grabbing the global mutex to make sure the parent itself wasn't destroyed in another thread already.

Alternate Implementations


There are of course many ways to skin a cat, and you may notice there is a partial alternate implementation in folly/test/ThreadCachedIntTest.cpp that provides similar performance. ShardedAtomicInt simply uses an array ofstd::atomic<int64_t>'s and hashes threads across them to do low-contention atomic increments, and readFull just sums up all the ints.

This sounds great, but in order to get the contention low enough to get similar performance as ThreadCachedInt with 24 threads, ShardedAtomicInt needs about 2000 ints to hash across. This uses about 20x more memory, and the lock-freereadFull has to sum up all 2048 ints, which ends up being a about 50x slower than ThreadCachedInt in low contention situations, which is hopefully the common case since it's designed for high-write, low read access patterns. Performance of readFull is about the same speed as ThreadCachedInt in high contention environments.

Depending on the operating conditions, it may make more sense to use one implementation over the other. For example, a lower contention environment will probably be able to use a ShardedAtomicInt with a much smaller array without hurting performance, while improving memory consumption and perf of readFull.

最新文章

  1. 离线安装Cloudera Manager 5和CDH5(最新版5.1.3) 完全教程
  2. IOS开发关于测试的好的网址资源
  3. 老鸟需要知道的一些php系统类函数
  4. AppServ设置虚拟主机 及域名连接
  5. mqtt实现自动监听服务器消息
  6. 对于SQL的Join,在学习起来可能是比较乱的。我们知道,SQL的Join语法有很多inner的,有outer的,有left的,有时候,对于Select出来的结果集是什么样子有点不是很清楚。Coding Horror上有一篇文章,通过文氏图 Venn diagrams 解释了SQL的Join。我觉得清楚易懂,转过来。
  7. 一张有料的图片!!!附文件-图片合成器C语言实现算法
  8. HAOI2015 简要题解
  9. solr7.4 tomcat环境下搭建(windows)
  10. couchdb
  11. 数据库存储 datetime,时差问题
  12. RBAC用户角色权限设计方案【转载】
  13. 【刷题】LOJ 6010 「网络流 24 题」数字梯形
  14. C++多线程编程一
  15. ImportError: cannot import name wordnet
  16. ubuntu上安装mysql及导入导出
  17. MITx: 6.00.1x Introduction to Computer Science and Programming Using Python Week 2: Simple Programs 4. Functions
  18. C#windows窗体应用程序如何自适应大小
  19. PHP+phpMyAdmin编程插入数据显示中文乱码的问题
  20. RxJava2 方法总结

热门文章

  1. hdu KiKi&#39;s K-Number 主席树
  2. codeforces 98 div2 C.History 水题
  3. bin log、redo log、undo log和MVVC
  4. SpringBoot创建多模块方式以及打包方式
  5. SpringMVC注解@RequestMapping @RequestParam @ResponseBody 和 @RequestBody 解析
  6. HDU 2860 (模拟+并查集)
  7. 浅谈jsonp
  8. 升级OPENSSH 和 OPENSSL
  9. pycharm 教程(一)安装和首次使用
  10. please complete all spokes before continuing 提示