01 幂等性如此重要

Kafka作为分布式MQ,大量用于分布式系统中,如消息推送系统、业务平台系统(如结算平台),就拿结算来说,业务方作为上游把数据打到结算平台,如果一份数据被计算、处理了多次,产生的后果将会特别严重。

02 哪些因素影响幂等性

使用Kafka时,需要保证exactly-once语义。要知道在分布式系统中,出现网络分区是不可避免的,如果kafka broker 在回复ack时,出现网络故障或者是full gc导致ack timeout,producer将会重发,如何保证producer重试时不造成重复or乱序?又或者producer 挂了,新的producer并没有old producer的状态数据,这个时候如何保证幂等?即使Kafka 发送消息满足了幂等,consumer拉取到消息后,把消息交给线程池workers,workers线程对message的处理可能包含异步操作,又会出现以下情况:

  • 先commit,再执行业务逻辑:提交成功,处理失败 。造成丢失
  • 先执行业务逻辑,再commit:提交失败,执行成功。造成重复执行
  • 先执行业务逻辑,再commit:提交成功,异步执行fail。造成丢失

本文将针对以上问题作出讨论

03 Kafka保证发送幂等性

      针对以上的问题,kafka在0.11版新增了幂等型producer和事务型producer。前者解决了单会话幂等性等问题,后者解决了多会话幂等性。

单会话幂等性

    为解决producer重试引起的乱序和重复。Kafka增加了pid和seq。Producer中每个RecordBatch都有一个单调递增的seq; Broker上每个tp也会维护pid-seq的映射,并且每Commit都会更新lastSeq。这样recordBatch到来时,broker会先检查RecordBatch再保存数据:如果batch中 baseSeq(第一条消息的seq)比Broker维护的序号(lastSeq)大1,则保存数据,否则不保存(inSequence方法)。

ProducerStateManager.scala

private def maybeValidateAppend(producerEpoch: Short, firstSeq: Int, offset: Long): Unit = {
validationType match {
case ValidationType.None => case ValidationType.EpochOnly =>
checkProducerEpoch(producerEpoch, offset) case ValidationType.Full =>
checkProducerEpoch(producerEpoch, offset)
checkSequence(producerEpoch, firstSeq, offset)
}
} private def checkSequence(producerEpoch: Short, appendFirstSeq: Int, offset: Long): Unit = {
if (producerEpoch != updatedEntry.producerEpoch) {
if (appendFirstSeq != 0) {
if (updatedEntry.producerEpoch != RecordBatch.NO_PRODUCER_EPOCH) {
throw new OutOfOrderSequenceException(s"Invalid sequence number for new epoch at offset $offset in " +
s"partition $topicPartition: $producerEpoch (request epoch), $appendFirstSeq (seq. number)")
} else {
throw new UnknownProducerIdException(s"Found no record of producerId=$producerId on the broker at offset $offset" +
s"in partition $topicPartition. It is possible that the last message with the producerId=$producerId has " +
"been removed due to hitting the retention limit.")
}
}
} else {
val currentLastSeq = if (!updatedEntry.isEmpty)
updatedEntry.lastSeq
else if (producerEpoch == currentEntry.producerEpoch)
currentEntry.lastSeq
else
RecordBatch.NO_SEQUENCE if (currentLastSeq == RecordBatch.NO_SEQUENCE && appendFirstSeq != 0) {
// We have a matching epoch, but we do not know the next sequence number. This case can happen if
// only a transaction marker is left in the log for this producer. We treat this as an unknown
// producer id error, so that the producer can check the log start offset for truncation and reset
// the sequence number. Note that this check follows the fencing check, so the marker still fences
// old producers even if it cannot determine our next expected sequence number.
throw new UnknownProducerIdException(s"Local producer state matches expected epoch $producerEpoch " +
s"for producerId=$producerId at offset $offset in partition $topicPartition, but the next expected " +
"sequence number is not known.")
} else if (!inSequence(currentLastSeq, appendFirstSeq)) {
throw new OutOfOrderSequenceException(s"Out of order sequence number for producerId $producerId at " +
s"offset $offset in partition $topicPartition: $appendFirstSeq (incoming seq. number), " +
s"$currentLastSeq (current end sequence number)")
}
}
} private def inSequence(lastSeq: Int, nextSeq: Int): Boolean = {
nextSeq == lastSeq + 1L || (nextSeq == 0 && lastSeq == Int.MaxValue)
}

引申:Kafka producer 对有序性做了哪些处理

     假设我们有5个请求,batch1、batch2、batch3、batch4、batch5;如果只有batch2 ack failed,3、4、5都保存了,那2将会随下次batch重发而造成重复。我们可以设置max.in.flight.requests.per.connection=1(客户端在单个连接上能够发送的未响应请求的个数)来解决乱序,但降低了系统吞吐。

新版本kafka设置enable.idempotence=true后能够动态调整max-in-flight-request。正常情况下max.in.flight.requests.per.connection大于1。当重试请求到来且时,batch 会根据 seq重新添加到队列的合适位置,并把max.in.flight.requests.per.connection设为1,这样它 前面的 batch序号都比它小,只有前面的都发完了,它才能发。

    private void insertInSequenceOrder(Deque<ProducerBatch> deque, ProducerBatch batch) {
// When we are requeing and have enabled idempotence, the reenqueued batch must always have a sequence.
if (batch.baseSequence() == RecordBatch.NO_SEQUENCE)
throw new IllegalStateException("Trying to re-enqueue a batch which doesn't have a sequence even " +
"though idempotency is enabled."); if (transactionManager.nextBatchBySequence(batch.topicPartition) == null)
throw new IllegalStateException("We are re-enqueueing a batch which is not tracked as part of the in flight " +
"requests. batch.topicPartition: " + batch.topicPartition + "; batch.baseSequence: " + batch.baseSequence()); ProducerBatch firstBatchInQueue = deque.peekFirst();
if (firstBatchInQueue != null && firstBatchInQueue.hasSequence() && firstBatchInQueue.baseSequence() < batch.baseSequence()) { List<ProducerBatch> orderedBatches = new ArrayList<>();
while (deque.peekFirst() != null && deque.peekFirst().hasSequence() && deque.peekFirst().baseSequence() < batch.baseSequence())
orderedBatches.add(deque.pollFirst()); log.debug("Reordered incoming batch with sequence {} for partition {}. It was placed in the queue at " +
"position {}", batch.baseSequence(), batch.topicPartition, orderedBatches.size());
// Either we have reached a point where there are batches without a sequence (ie. never been drained
// and are hence in order by default), or the batch at the front of the queue has a sequence greater
// than the incoming batch. This is the right place to add the incoming batch.
deque.addFirst(batch); // Now we have to re insert the previously queued batches in the right order.
for (int i = orderedBatches.size() - 1; i >= 0; --i) {
deque.addFirst(orderedBatches.get(i));
} // At this point, the incoming batch has been queued in the correct place according to its sequence.
} else {
deque.addFirst(batch);
}
}

多会话幂等性

     在单会话幂等性中介绍,kafka通过引入pid和seq来实现单会话幂等性,但正是引入了pid,当应用重启时,新的producer并没有old producer的状态数据。可能重复保存。

Kafka事务通过隔离机制来实现多会话幂等性

kafka事务引入了transactionId 和Epoch,设置transactional.id后,一个transactionId只对应一个pid, 且Server 端会记录最新的 Epoch 值。这样有新的producer初始化时,会向TransactionCoordinator发送InitPIDRequest请求, TransactionCoordinator 已经有了这个 transactionId对应的 meta,会返回之前分配的 PID,并把 Epoch 自增 1 返回,这样当old producer恢复过来请求操作时,将被认为是无效producer抛出异常。 如果没有开启事务,TransactionCoordinator会为新的producer返回new pid,这样就起不到隔离效果,因此无法实现多会话幂等。

private def maybeValidateAppend(producerEpoch: Short, firstSeq: Int, offset: Long): Unit = {
validationType match {
case ValidationType.None => case ValidationType.EpochOnly =>
checkProducerEpoch(producerEpoch, offset) case ValidationType.Full => //开始事务,执行这个判断
checkProducerEpoch(producerEpoch, offset)
checkSequence(producerEpoch, firstSeq, offset)
}
} private def checkProducerEpoch(producerEpoch: Short, offset: Long): Unit = {
if (producerEpoch < updatedEntry.producerEpoch) {
throw new ProducerFencedException(s"Producer's epoch at offset $offset is no longer valid in " +
s"partition $topicPartition: $producerEpoch (request epoch), ${updatedEntry.producerEpoch} (current epoch)")
}
}

04 Consumer端幂等性

如上所述,consumer拉取到消息后,把消息交给线程池workers,workers对message的handle可能包含异步操作,又会出现以下情况:

  • 先commit,再执行业务逻辑:提交成功,处理失败 。造成丢失
  • 先执行业务逻辑,再commit:提交失败,执行成功。造成重复执行
  • 先执行业务逻辑,再commit:提交成功,异步执行fail。造成丢失

对此我们常用的方法时,works取到消息后先执行如下code:

if(cache.contain(msgId)){
// cache中包含msgId,已经处理过
continue;
}else {
lock.lock();
cache.put(msgId,timeout);
commitSync();
lock.unLock();
}
// 后续完成所有操作后,删除cache中的msgId,只要msgId存在cache中,就认为已经处理过。Note:需要给cache设置有消息

如果喜欢我的文章,请长按二维码,关注靳刚同学

您的转发是对我最大的支持,谢谢!

最新文章

  1. 使用java读取文件夹中文件的行数
  2. ecshop 实现购物车退出不清空
  3. 从零开始学习Linux(mkdir and rmdir)
  4. nginx 配置 ThinkPHP Rewrite
  5. UISearchController的使用。(iOS8+)
  6. 转--DataTable 修改列名 删除列 调整列顺序
  7. 阿里云OSS Multipart Upload上传实例
  8. C# .net基于Http实现web server(web服务)
  9. jquery选择器项目实例分析
  10. 三、原子变量与CAS算法
  11. Navicat Premium for Mac 破解版地址
  12. Latex graphicx 宏包 scalebox命令
  13. Linux内核之旅
  14. TIMER门控模式控制PWM输出长度
  15. JavaScript 添加新元素
  16. sock基础编程介绍
  17. 探讨ES6的import export default 和CommonJS的require module.exports
  18. House of Roman 实战
  19. 使用 JLINK 的 RTT 功能 进行 调试打印数据
  20. rc_80 tomcat 日志

热门文章

  1. HDU 3879 &amp;&amp; BZOJ 1497:Base Station &amp;&amp; 最大获利 (最大权闭合图)
  2. C# 中的委托和事件本质讲解
  3. kuangbin专题 专题一 简单搜索 Fire Game FZU - 2150
  4. C语言学习书籍推荐《C语言程序设计 现代方法(第2版)》下载
  5. .Net Core 使用Http请求及基于 Polly 的处理故障
  6. Java编程思想:内部类基础部分
  7. 五分钟了解Zabbix
  8. Android使用webService(发送xml数据的方式,不使用jar包)
  9. Error:too many padding sections on bottom border.
  10. liunx基本操作操作与文件和目录的管理