RecordAccumulator 1

介绍

前面讲过producer会将数据保存在RecordAccumulator中，并通过Sender发送数据。RecordAccumulator 就相当于一个队列保存着那些准备发送到server的数据。

在producer中，有几个参数和RecordAccumulator 有关系：

buffer.memory

buffer.memory主要用来保存要发送的数据，里面的内存大部分是用来让RecordAccumulator保存数据的。
compression.type

压缩格式
batch.size

每个发送的batch大小
linger.ms

如果batch没有达到batch.size大小，但是已经等待了linger.ms长的时间，也会发送

从上面的内容我们大体可以看出RecordAccumulator的作用：

数据读进来了，分配内存，并保存数据到一个一个的batch中，并返回是添加成功还是失败了。
找到那些满足发送条件的batches，然后由sender发送，发送的时候，如果有需要保证发送信息的前后顺序。
flush数据，将所有的消息都发送出去。
强行停止，所有的batch都不发送了。
释放内存，2，3，4执行完了后，都需要将对应的batch占用的内存释放掉。

RecordAccumulator 的数据都保存在指定大小的内存中，所以会有一个内存池来分配内存。这个变量就是private final BufferPool free;

private final ConcurrentMap<TopicPartition, Deque<RecordBatch>> batches; 是用来保存消息队列的。里面每个TopicPartition，都会有一个Deque，保存每个RecordBatch。RecordBatch的本质就是一个ByteBuffer，它的大小就是前面介绍中提到的batch.size的大小。

图1

图1表示的RecordAccumulator的内存分配，大部分都是给了batches，还有一小部分给了正在飞的batch（发送到服务器，但是没有收到response）

添加数据append

在KafkaProducer的doSend函数中，会调用append函数将数据写入accumulator 中。



 private Future<RecordMetadata> doSend(ProducerRecord<K, V> record, Callback callback) {

    ....

    RecordAccumulator.RecordAppendResult result = accumulator.append(tp, timestamp, serializedKey, serializedValue, interceptCallback, remainingWaitMs);

    if (result.batchIsFull || result.newBatchCreated) {

        log.trace("Waking up the sender since topic {} partition {} is either full or getting a new batch", record.topic(), partition);

        this.sender.wakeup();

    }

        return result.future;

    ....

 }

append 函数主要将消息append到TopicPartition的batch中。在append的时候，如果batch已经存在了，就直接添加到对应的batch中。如果batch不存在，那就从bufferPool中申请一个新的内存，然后写入消息。

    public RecordAppendResult append(TopicPartition tp,

                                     long timestamp,

                                     byte[] key,

                                     byte[] value,

                                     Callback callback,

                                     long maxTimeToBlock) throws InterruptedException {

        // We keep track of the number of appending thread to make sure we do not miss batches in

        // abortIncompleteBatches().

        appendsInProgress.incrementAndGet();

        try {

            // check if we have an in-progress batch

            // 创建或者获取 tp 对应的 Deque

            Deque<RecordBatch> dq = getOrCreateDeque(tp);

            // 如果有Deque中有batch，就往这个batch中添加信息，并返回添加结果，如果没有，就返回null

            synchronized (dq) {

                if (closed)

                    throw new IllegalStateException("Cannot send after the producer is closed.");

                RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);

                if (appendResult != null)

                    return appendResult;

            }

            // we don't have an in-progress record batch try to allocate a new batch

            // 如果没有batch， 就分配一个内存出来

            int size = Math.max(this.batchSize, Records.LOG_OVERHEAD + Record.recordSize(key, value));

            log.trace("Allocating a new {} byte message buffer for topic {} partition {}", size, tp.topic(), tp.partition());

            ByteBuffer buffer = free.allocate(size, maxTimeToBlock);

            synchronized (dq) {

                // Need to check if producer is closed again after grabbing the dequeue lock.

                if (closed)

                    throw new IllegalStateException("Cannot send after the producer is closed.");

                //再次尝试添加，如果添加成功了，那就说明已经有另外一个线程建好了batch，这个时候就把刚分配好的内存还到bufferPool

                RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);

                if (appendResult != null) {

                    // Somebody else found us a batch, return the one we waited for! Hopefully this doesn't happen often...

                    free.deallocate(buffer);

                    return appendResult;

                }

                // 开始创建 batch

                MemoryRecords records = MemoryRecords.emptyRecords(buffer, compression, this.batchSize);

                RecordBatch batch = new RecordBatch(tp, records, time.milliseconds());

                //开始添加消息到batch中，如果这次添加失败了，那就说明有问题了，抛出一个异常

                // 不过应该不会发生返回null的情况

                FutureRecordMetadata future = Utils.notNull(batch.tryAppend(timestamp, key, value, callback, time.milliseconds()));

                dq.addLast(batch);

                // 将这个batch 标记为不完整

                incomplete.add(batch);

                return new RecordAppendResult(future, dq.size() > 1 || batch.records.isFull(), true);

            }

        } finally {

            appendsInProgress.decrementAndGet();

        }

    }

在上面函数中有几点需要理解的地方：

分配内存这段代码并没有包含在synchronized 中，所以很可能同时会有多个线程申请内存。这个时候如果线程A申请到内存后，线程B已经创建好了，并且创建好了batch（这段代码用synchronized包含，所以同时只有一个线程进行操作）。那么线程A应该再次去尝试添加，如果添加成功了，就释放内存，将内存还给BufferPool。
为什么分配内存这段代码没有被包含在synchronized 中呢，因为BufferPool会一直等待，直到有足够的内存分配给申请的线程。如果加到synchronized中，那整个Deque都会被锁住，那其他线程就没办法访问这个Deque了。
如果数据写入到batch的频率和Sender发送的频率相等，那么每次写入batch的时候都需要申请内存，创建batch。如果数据写入到batch的频率大于Sender发送的频率，那么每次写入batch的时候就可以直接写入这个batch，直到batch满了或者等待的时间大于等于linger.ms。

获取数据

整个数据的获取都包含在Sender 的 run 函数中。

找到集群中那些已经准备好发送信息的节点。
获取要发送到这些节点的RecordBatchs.
找到那些已经过期的RecordBatchs。

 void run(long now) {

        获取到当前的集群信息

        Cluster cluster = metadata.fetch();

        // get the list of partitions with data ready to send

        获取当前准备发送的partitions，获取的条件如下：

        1.record set 满了

        2.record 等待的时间达到了 lingerms

        3.accumulator 的内存满了

        4.accumulator 要关闭了

        RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);

        如果有些partition没有leader信息，更新metadata

        // if there are any partitions whose leaders are not known yet, force metadata update

        if (!result.unknownLeaderTopics.isEmpty()) {

            // The set of topics with unknown leader contains topics with leader election pending as well as

            // topics which may have expired. Add the topic again to metadata to ensure it is included

            // and request metadata update, since there are messages to send to the topic.

            for (String topic : result.unknownLeaderTopics)

                this.metadata.add(topic);

            this.metadata.requestUpdate();

        }

        去掉那些不能发送信息的节点，能够发送的原因有：

        1.当前节点的信息是可以信赖的

        2.能够往这些节点发送信息

        // remove any nodes we aren't ready to send to

        Iterator<Node> iter = result.readyNodes.iterator();

        long notReadyTimeout = Long.MAX_VALUE;

        while (iter.hasNext()) {

            Node node = iter.next();

            if (!this.client.ready(node, now)) {

                iter.remove();

                notReadyTimeout = Math.min(notReadyTimeout, this.client.connectionDelay(node, now));

            }

        }

        获取要发送的records

        // create produce requests

        Map<Integer, List<RecordBatch>> batches = this.accumulator.drain(cluster,

                                                                         result.readyNodes,

                                                                         this.maxRequestSize,

                                                                         now);

        保证发送的顺序

        if (guaranteeMessageOrder) {

            // Mute all the partitions drained

            for (List<RecordBatch> batchList : batches.values()) {

                for (RecordBatch batch : batchList)

                    this.accumulator.mutePartition(batch.topicPartition);

            }

        }

        检查那些过期的records

        List<RecordBatch> expiredBatches = this.accumulator.abortExpiredBatches(this.requestTimeout, now);

        // update sensors

        for (RecordBatch expiredBatch : expiredBatches)

            this.sensors.recordErrors(expiredBatch.topicPartition.topic(), expiredBatch.recordCount);

        sensors.updateProduceRequestMetrics(batches);

        构建request并发送

        List<ClientRequest> requests = createProduceRequests(batches, now);

        // If we have any nodes that are ready to send + have sendable data, poll with 0 timeout so this can immediately

        // loop and try sending more data. Otherwise, the timeout is determined by nodes that have partitions with data

        // that isn't yet sendable (e.g. lingering, backing off). Note that this specifically does not include nodes

        // with sendable data that aren't ready to send since they would cause busy looping.

        long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);

        if (result.readyNodes.size() > 0) {

            log.trace("Nodes with data ready to send: {}", result.readyNodes);

            log.trace("Created {} produce requests: {}", requests.size(), requests);

            pollTimeout = 0;

        }

        将这些requests加入channel中

        for (ClientRequest request : requests)

            client.send(request, now);

        // if some partitions are already ready to be sent, the select time would be 0;

        // otherwise if some partition already has some data accumulated but not ready yet,

        // the select time will be the time difference between now and its linger expiry time;

        // otherwise the select time will be the time difference between now and the metadata expiry time;

        真正的发送消息

        this.client.poll(pollTimeout, now);

    }

在发送消息之前，produer需要直到那些节点是可以发送消息的，而这个就是通过 public ReadyCheckResult ready(Cluster cluster, long nowMs) 来获得的。

mute

在这里需要了解RecordAccumulator 的一个成员变量private final Set<TopicPartition> muted;。这个set里面保存了所有已经发送batch到server中，但是没有收到ack的TopicPartition，俗称inflight。等到接收到server的reponse后，会将对应的TopicPartition从set中去掉。这样子就可以保证每个TopicPartition的发送顺序。

举例子，假如topic1要发送A,B,C三个batch到server。如果直接将A,B,C按照顺序发送过去，server的收到的顺序可能是C,B,A,这样子落到log中的顺序就变掉了。如果使用mute，发送A，mute里面就包含了topic1, 这个时候，Sender就不会从topic1所在的Deque中取batch了，直到produer收到了batch A 对应的response，然后从mute中去掉topic1。然后发送B...这样子就保证了服务器接收的顺序和producer发送的消息是一样的。

ready

在发送消息之前，需要确定一些信息：

哪些TopicPartition所对应的Node节点是可以发送信息的。
下次检查节点是否ready的时间。
哪些TopicPartition对应的leader找不到。

这些都是在ready函数中实现的，返回的结果封装在ReadyCheckResult中。

实际上，在发送过程中，可以向一个节点发送消息的时候需要满足下面的条件：

这个节点中至少有一个partition是可以正常发送的（没有处在backing off状态），并且这个 partition 没有 muted。
batch 没有满，但是已经等了lingerMs 长的时间。
accumulator 满了。
accumulator 关闭了。

    public ReadyCheckResult ready(Cluster cluster, long nowMs) {

        Set<Node> readyNodes = new HashSet<>();

        long nextReadyCheckDelayMs = Long.MAX_VALUE;

        Set<String> unknownLeaderTopics = new HashSet<>();

        boolean exhausted = this.free.queued() > 0;

        for (Map.Entry<TopicPartition, Deque<RecordBatch>> entry : this.batches.entrySet()) {

            TopicPartition part = entry.getKey();

            Deque<RecordBatch> deque = entry.getValue();

            /*

             * 遍历batches中每个tp

             * 获取tp对应的leader

             */

            Node leader = cluster.leaderFor(part);

            synchronized (deque) {

                // 如果 leader 为 null ，并且deque 不为空，说明要发送消息却找不到cluster中接收消息的节点

                // 就添加到 unknownLeaderTopics

                if (leader == null && !deque.isEmpty()) {

                    // This is a partition for which leader is not known, but messages are available to send.

                    // Note that entries are currently not removed from batches when deque is empty.

                    unknownLeaderTopics.add(part.topic());

                // 如果leader 没有ready， 并且 part 没有在飞

                } else if (!readyNodes.contains(leader) && !muted.contains(part)) {

                    RecordBatch batch = deque.peekFirst();

                    if (batch != null) {

                        // 如果这个 batch 重试了， 就看看这个 batch 上次发送的时间 + retryBackoffMs 的时间长度 比当前时间要晚

                        boolean backingOff = batch.attempts > 0 && batch.lastAttemptMs + retryBackoffMs > nowMs;

                        // 等待的时间

                        long waitedTimeMs = nowMs - batch.lastAttemptMs;

                        // 等待时间

                        long timeToWaitMs = backingOff ? retryBackoffMs : lingerMs;

                        // 剩余的时间

                        long timeLeftMs = Math.max(timeToWaitMs - waitedTimeMs, 0);

                        // batch 满了

                        boolean full = deque.size() > 1 || batch.records.isFull();

                        // batch 过期了，它等待的时间已经超过了 timeToWaitMs

                        boolean expired = waitedTimeMs >= timeToWaitMs;

                        boolean sendable = full || expired || exhausted || closed || flushInProgress();

                        if (sendable && !backingOff) {

                            readyNodes.add(leader);

                        } else {

                            // Note that this results in a conservative estimate since an un-sendable partition may have

                            // a leader that will later be found to have sendable data. However, this is good enough

                            // since we'll just wake up and then sleep again for the remaining time.

                            nextReadyCheckDelayMs = Math.min(timeLeftMs, nextReadyCheckDelayMs);

                        }

                    }

                }

            }

        }

        return new ReadyCheckResult(readyNodes, nextReadyCheckDelayMs, unknownLeaderTopics);

    }

drain

知道了要往那些Node 发送数据，就需要从accumulator中获取要发送的数据，要获取的数据的大小为max.request.size, 它是由几个不同的partition的batch组成的。这些batch可以被发送的的条件是：

batch对应的tp没有数据在飞（已经发送出去了，但是没有收到response）。
batch处在retry状态,并且已经等待了backoff长的时间。

通过drain 函数，就可以得到这次request要发送batches了。因为drain是多线程并发的，所以在从Deque中取batch的时候，需要synchronized(deque)。

    public Map<Integer, List<RecordBatch>> drain(Cluster cluster,

                                                 Set<Node> nodes,

                                                 int maxSize,

                                                 long now) {

        if (nodes.isEmpty())

            return Collections.emptyMap();

        Map<Integer, List<RecordBatch>> batches = new HashMap<>();

        for (Node node : nodes) {

            int size = 0;

            //获取node 中对应的partition

            List<PartitionInfo> parts = cluster.partitionsForNode(node.id());

            List<RecordBatch> ready = new ArrayList<>();

            /* to make starvation less likely this loop doesn't start at 0 */

            // 避免每次都从一个相同的partition开始，别的partition没机会发送数据

            int start = drainIndex = drainIndex % parts.size();

            do {

                PartitionInfo part = parts.get(drainIndex);

                TopicPartition tp = new TopicPartition(part.topic(), part.partition());

                // Only proceed if the partition has no in-flight batches.

                if (!muted.contains(tp)) {

                    Deque<RecordBatch> deque = getDeque(new TopicPartition(part.topic(), part.partition()));

                    if (deque != null) {

                    // 注意synchronized

                        synchronized (deque) {

                            RecordBatch first = deque.peekFirst();

                            if (first != null) {

                                // 查看当前batch 是不是在retry,并且没有达到需要等待的 backoff时间，如果不是的话，就加入

                                boolean backoff = first.attempts > 0 && first.lastAttemptMs + retryBackoffMs > now;

                                // Only drain the batch if it is not during backoff period.

                                if (!backoff) {

                                    // 如果batch的大小大于maxSize 并且 ready 里面有东西，就停止

                                    // 这时候有一种情况就是batch的大小大于maxSize但是ready 里面没有内容就把这个batch加入ready中

                                    if (size + first.records.sizeInBytes() > maxSize && !ready.isEmpty()) {

                                        // there is a rare case that a single batch size is larger than the request size due

                                        // to compression; in this case we will still eventually send this batch in a single

                                        // request

                                        break;

                                    } else {

                                        //添加到ready， 注意要close batch.records

                                        RecordBatch batch = deque.pollFirst();

                                        batch.records.close();

                                        size += batch.records.sizeInBytes();

                                        ready.add(batch);

                                        batch.drainedMs = now;

                                    }

                                }

                            }

                        }

                    }

                }

                this.drainIndex = (this.drainIndex + 1) % parts.size();

            } while (start != drainIndex);

            batches.put(node.id(), ready);

        }

        return batches;

    }

flush

在发送消息的时候，如果想要将所有的数据都发送出去，就需要调用kafkaproducer的flush函数。调用flush后，会将所有的batch都发送出去（不严谨）。

    public void flush() {

        log.trace("Flushing accumulated records in producer.");

        this.accumulator.beginFlush();

        this.sender.wakeup();

        try {

            this.accumulator.awaitFlushCompletion();

        } catch (InterruptedException e) {

            throw new InterruptException("Flush interrupted.", e);

        }

    }

上面是flush函数的实现，首先开始flush，然后通知sender 发送消息，最后等待所有消息发送完毕。

这里面涉及到 RecordAccumulator 的一个成员变量flushesInProgress，它是一个AtomicInteger。当它大于0的时候，所有的batch都会被发送出去。

那么beginFlush就是将flushesInProgress++。

    public void beginFlush() {

        this.flushesInProgress.getAndIncrement();

    }

在ready函数中，判断是否可以发送batch的条件如下：

    public ReadyCheckResult ready(Cluster cluster, long nowMs) {

        ....

        for (Map.Entry<TopicPartition, Deque<RecordBatch>> entry : this.batches.entrySet()) {

                        //判断条件

                        boolean sendable = full || expired || exhausted || closed || flushInProgress();

                        if (sendable && !backingOff) {

                            readyNodes.add(leader);

                        } else {

                        ....

                        }

        ....

    }

    boolean flushInProgress() {

        return flushesInProgress.get() > 0;

    }

在append 数据的时候，如果batch是新建的，就会将这个batch加入到incomplete 的Set中，直到收到了服务器的response，才会将这个batch从 incomplete 去掉。而awaitFlushCompletion就是等待incomplete 为空后，就结束了。这样子accumulator中所有的数据都会被发送出去。

    public void awaitFlushCompletion() throws InterruptedException {

        try {

            for (RecordBatch batch : this.incomplete.all())

                batch.produceFuture.await();

        } finally {

            this.flushesInProgress.decrementAndGet();

        }

    }

abort

当然还有Sender要强制关闭的时候，这个时候就需要将accumulator中所有的batch占用的内存释放掉，然后close掉就Ok了。

巴特西