Data deduplication provides a new approach to store data and eliminate duplicate data in chunk level.
A typical data deduplication workflow can be explained like this.

File metadata describes how to restore the file use unique chunks.

Chunk level deduplication approach has five key stages.

  1. Chunking
  2. Fingerprinting
  3. Indexing fringerprints
  4. Further compression
  5. Storage management

Different Stage has its own challenges, which may become the bottleneck for restoring file or compressing files.

Chunking

At the chunking stage, we should split the data stream into chunks, which can be presented at the fingerprints.
The different method splitting data streaming has different result and different efficiency.

The splitting method can be divided into two categories:
Fixed Size Chunking, which just split the data stream into fixed size chunk, simply and easily .
Content Defined Chunking, which split the data into variable size chunk, depending on the content.

Although fixed size chunking is simple and quick, the biggest problem is Boundary Shift. Boundary Shift Problem is when little part of data stream is modified, all the subsequent chunks will be changed, because of the boundary is shifted.
Content defined chunking uses a sliding-window technique on the content of data stream and computes a hash value of the window. If the hash value is satisfied some predefined conditions, it will generate a chunk.

Chunk's size can also be optimized. If we use CDC (content defined chunking), the size of chunking can not be in charge. On some extremely condition, it will generate too large or too small chunking. If a chunking is too large, the compression ratio will decrease. Because the large chunk can hide duplicates from being detected. If a chunking is too small, the file metadata will increase. What's more, it can cause indexing fingerprints problem. So we can define the max and min chunk size.

Chunking still has some problems such as how to detect the deduplicate accurately, how to accelerate computing time cost.

最新文章

  1. wpf listview
  2. HTTP 传输内容的压缩
  3. 【CLR Via C#】第5章 基元类型、引用类型、值类型
  4. 初识你---------Swift【下篇】
  5. (转)深入浅出 iOS 之生命周期
  6. css 垂直同步的几种方式
  7. JPA 2.1实例(hibernate 实现)
  8. 在ubuntu安装Phabricator(转)
  9. Java基本语法-----java进制的转换
  10. svn 部署
  11. asp.net WebService如何去掉asmx后缀
  12. Redisson实现分布式锁(二)
  13. mysql join left join区别
  14. input不记录之前输入的值
  15. oc的静态函数static
  16. hdu 1158 dp Employment Planning
  17. [启动]Linux启动流程rcN.d rcS.d rc.local等
  18. 读《分布式一致性原理》JAVA客户端API操作
  19. easyui引入
  20. inflate

热门文章

  1. 分库分表(4) ---SpringBoot + ShardingSphere 实现分表
  2. css浮动产生和清除浮动的几种方式
  3. 易错、经典问题:return不可返回指向栈内存的指针
  4. PMP 德尔菲技术
  5. [51nod1670] 打怪兽
  6. 第三篇-分析日志和sensor-data中的数据结构
  7. 数据结构1_C---单链表的逆转
  8. 基于Spring Boot的统一异常处理设计
  9. go-关键字-变量
  10. vue —— 监听