io.h — Reference-Counted Block-Chain I/O Buffer

Introduction

io.h provides xIOBuffer, a non-contiguous byte buffer composed of a chain of reference-counted memory blocks. It supports zero-copy split, append, and scatter-gather I/O (readv/writev). Inspired by brpc's IOBuf, it is designed for high-throughput network I/O where avoiding memory copies is critical.

Design Philosophy

  1. Block-Chain Architecture — Data is stored across multiple fixed-size blocks (default 8KB each), linked through a reference array. This avoids large contiguous allocations and enables zero-copy operations.

  2. Reference Counting — Each xIOBlock is reference-counted. Multiple xIOBuffer instances can share the same block (e.g., after a Cut operation). Blocks are freed (returned to pool) when the last reference is released.

  3. Zero-Copy OperationsxIOBufferAppendIOBuffer() transfers block references without copying data. xIOBufferCut() splits a buffer by adjusting offsets and sharing blocks at the boundary.

  4. Lock-Free Block Pool — Released blocks are returned to a global Treiber stack (lock-free) for reuse, avoiding malloc/free overhead in steady state.

  5. Inline Ref Array — Small buffers (≤ 8 refs) use an inline array, avoiding heap allocation for the ref array itself. Larger buffers transition to a heap-allocated array.

Architecture

graph TD
    subgraph "xIOBuffer API"
        APPEND["Append / AppendStr"]
        APPEND_IO["AppendIOBuffer<br/>(zero-copy)"]
        READ["Read / CopyTo"]
        CUT["Cut<br/>(zero-copy split)"]
        CONSUME["Consume"]
        IO_READ["ReadFd"]
        IO_WRITE["WriteFd<br/>(writev)"]
    end

    subgraph "Block Management"
        ACQUIRE["xIOBlockAcquire"]
        RETAIN["xIOBlockRetain"]
        RELEASE["xIOBlockRelease"]
    end

    subgraph "Block Pool (Treiber Stack)"
        POOL["g_pool_head"]
        WARMUP["xIOBlockPoolWarmup"]
        DRAIN["xIOBlockPoolDrain"]
    end

    APPEND --> ACQUIRE
    IO_READ --> ACQUIRE
    CUT --> RETAIN
    CONSUME --> RELEASE
    READ --> RELEASE
    ACQUIRE --> POOL
    RELEASE --> POOL
    WARMUP --> POOL
    DRAIN --> POOL

    style POOL fill:#f5a623,color:#fff

Implementation Details

Block Structure

XDEF_STRUCT(xIOBlock) {
    size_t refs;                       // Reference count (atomic)
    size_t size;                       // Usable data size
    char   data[XIOBUFFER_BLOCK_SIZE]; // 8KB inline data
};

Reference Structure

XDEF_STRUCT(xIOBufferRef) {
    xIOBlock *block;   // Pointer to the underlying block
    size_t    offset;  // Start offset within block->data
    size_t    length;  // Number of valid bytes from offset
};

IOBuffer Structure

XDEF_STRUCT(xIOBuffer) {
    xIOBufferRef  inlined[XIOBUFFER_INLINE_REFS]; // Inline ref storage (8)
    xIOBufferRef *refs;    // Pointer to ref array (inlined or heap)
    size_t        nrefs;   // Number of active refs
    size_t        cap;     // Capacity of refs array
    size_t        nbytes;  // Total logical byte count (cached)
};

Block-Chain Architecture

graph TD
    subgraph "xIOBuffer"
        REF1["Ref 0<br/>block=A, off=0, len=8192"]
        REF2["Ref 1<br/>block=B, off=0, len=8192"]
        REF3["Ref 2<br/>block=C, off=0, len=3000"]
    end

    subgraph "Shared Blocks"
        A["xIOBlock A<br/>refs=1, 8KB"]
        B["xIOBlock B<br/>refs=2, 8KB"]
        C["xIOBlock C<br/>refs=1, 8KB"]
    end

    REF1 --> A
    REF2 --> B
    REF3 --> C

    subgraph "Another xIOBuffer (after Cut)"
        REF4["Ref 0<br/>block=B, off=4096, len=4096"]
    end

    REF4 --> B

    style A fill:#4a90d9,color:#fff
    style B fill:#f5a623,color:#fff
    style C fill:#50b86c,color:#fff

Treiber Stack Block Pool

The global block pool uses a lock-free Treiber stack:

// Pool node overlays xIOBlock memory
XDEF_STRUCT(PoolNode_) {
    PoolNode_ *next;
};

static PoolNode_ *volatile g_pool_head = NULL;

Push (return to pool):

do {
    head = atomic_load(g_pool_head)
    node->next = head
} while (!CAS(g_pool_head, head, node))

Pop (acquire from pool):

do {
    head = atomic_load(g_pool_head)
    if (!head) return malloc(new block)
    next = head->next
} while (!CAS(g_pool_head, head, next))
return head

Zero-Copy Cut

xIOBufferCut(io, dst, n) moves the first n bytes from io to dst:

  1. Fully consumed refs — Ownership transfers directly (no refcount change).
  2. Boundary ref — The block is shared: xIOBlockRetain() increments the refcount, and both buffers hold a ref with different offset/length.
flowchart TD
    CUT["xIOBufferCut(io, dst, n)"]
    LOOP{"More bytes to cut?"}
    FULL{"ref.length <= remaining?"}
    TRANSFER["Transfer entire ref to dst<br/>(no refcount change)"]
    SPLIT["Share block: Retain + split ref<br/>dst gets [offset, chunk]<br/>io keeps [offset+chunk, rest]"]
    SHIFT["Shift consumed refs out of io"]
    DONE["Update nbytes for both"]

    CUT --> LOOP
    LOOP -->|Yes| FULL
    FULL -->|Yes| TRANSFER --> LOOP
    FULL -->|No| SPLIT --> SHIFT --> DONE
    LOOP -->|No| SHIFT

    style TRANSFER fill:#50b86c,color:#fff
    style SPLIT fill:#f5a623,color:#fff

Append Strategy

xIOBufferAppend(io, data, len):

  1. First tries to fill the tail block's remaining space (avoids allocating a new block for small appends).
  2. Allocates new blocks for remaining data, each up to XIOBUFFER_BLOCK_SIZE bytes.

API Reference

Configuration

MacroDefaultDescription
XIOBUFFER_BLOCK_SIZE8192Block data size in bytes
XIOBUFFER_INLINE_REFS8Inline ref array capacity

Block API

FunctionSignatureDescriptionThread Safety
xIOBlockAcquirexIOBlock *xIOBlockAcquire(void)Get a block from pool (or malloc). refs=1.Thread-safe (lock-free pool)
xIOBlockRetainvoid xIOBlockRetain(xIOBlock *blk)Increment refcount.Thread-safe (atomic)
xIOBlockReleasevoid xIOBlockRelease(xIOBlock *blk)Decrement refcount; return to pool at 0.Thread-safe (atomic + lock-free pool)
xIOBlockPoolWarmupxErrno xIOBlockPoolWarmup(size_t n)Pre-allocate n blocks into pool.Thread-safe
xIOBlockPoolDrainvoid xIOBlockPoolDrain(void)Free all pooled blocks. Call at shutdown.Not thread-safe (no concurrent use)

IOBuffer Lifecycle

FunctionSignatureDescriptionThread Safety
xIOBufferInitvoid xIOBufferInit(xIOBuffer *io)Initialize an empty IOBuffer.Not thread-safe
xIOBufferDeinitvoid xIOBufferDeinit(xIOBuffer *io)Release all refs and free ref array.Not thread-safe
xIOBufferResetvoid xIOBufferReset(xIOBuffer *io)Release all refs, keep ref array.Not thread-safe

IOBuffer Query

FunctionSignatureDescriptionThread Safety
xIOBufferLensize_t xIOBufferLen(const xIOBuffer *io)Total readable bytes.Not thread-safe
xIOBufferEmptybool xIOBufferEmpty(const xIOBuffer *io)True if no data.Not thread-safe
xIOBufferRefCountsize_t xIOBufferRefCount(const xIOBuffer *io)Number of block refs.Not thread-safe

IOBuffer Write

FunctionSignatureDescriptionThread Safety
xIOBufferAppendxErrno xIOBufferAppend(xIOBuffer *io, const void *data, size_t len)Append bytes (allocates blocks as needed).Not thread-safe
xIOBufferAppendStrxErrno xIOBufferAppendStr(xIOBuffer *io, const char *str)Append C string.Not thread-safe
xIOBufferAppendIOBufferxErrno xIOBufferAppendIOBuffer(xIOBuffer *io, xIOBuffer *other)Zero-copy: move all refs from other.Not thread-safe

IOBuffer Read

FunctionSignatureDescriptionThread Safety
xIOBufferReadsize_t xIOBufferRead(xIOBuffer *io, void *out, size_t len)Copy and consume bytes.Not thread-safe
xIOBufferCutsize_t xIOBufferCut(xIOBuffer *io, xIOBuffer *dst, size_t n)Zero-copy split: move first n bytes to dst.Not thread-safe
xIOBufferConsumesize_t xIOBufferConsume(xIOBuffer *io, size_t n)Discard first n bytes.Not thread-safe
xIOBufferCopyTosize_t xIOBufferCopyTo(const xIOBuffer *io, void *out)Linearize: copy all data to contiguous buffer.Not thread-safe

IOBuffer I/O

FunctionSignatureDescriptionThread Safety
xIOBufferReadIovint xIOBufferReadIov(const xIOBuffer *io, struct iovec *iov, int max_iov)Fill iovecs for writev().Not thread-safe
xIOBufferReadFdssize_t xIOBufferReadFd(xIOBuffer *io, int fd)Read from fd into IOBuffer.Not thread-safe
xIOBufferWriteFdssize_t xIOBufferWriteFd(xIOBuffer *io, int fd)Write to fd using writev().Not thread-safe

Usage Examples

Basic Usage

#include <stdio.h>
#include <xbuf/io.h>

int main(void) {
    xIOBuffer io;
    xIOBufferInit(&io);

    // Append data (may span multiple blocks)
    xIOBufferAppend(&io, "Hello, ", 7);
    xIOBufferAppend(&io, "IOBuffer!", 9);

    printf("Length: %zu, Refs: %zu\n",
           xIOBufferLen(&io), xIOBufferRefCount(&io));

    // Linearize for processing
    char buf[64];
    xIOBufferCopyTo(&io, buf);
    printf("Content: %.*s\n", (int)xIOBufferLen(&io), buf);

    xIOBufferDeinit(&io);
    return 0;
}

Zero-Copy Split (Protocol Parsing)

#include <xbuf/io.h>

void parse_protocol(xIOBuffer *io) {
    // Cut the 4-byte header from the front
    xIOBuffer header;
    xIOBufferInit(&header);

    size_t cut = xIOBufferCut(io, &header, 4);
    if (cut == 4) {
        char hdr[4];
        xIOBufferRead(&header, hdr, 4);
        // Parse header...
        // io now contains only the body (zero-copy!)
    }

    xIOBufferDeinit(&header);
}

High-Throughput Network I/O

#include <xbuf/io.h>

void handle_data(int sockfd) {
    // Pre-warm the block pool at startup
    xIOBlockPoolWarmup(64);

    xIOBuffer io;
    xIOBufferInit(&io);

    // Read from socket (allocates blocks from pool)
    ssize_t n = xIOBufferReadFd(&io, sockfd);
    if (n > 0) {
        // Write back using scatter-gather I/O
        xIOBufferWriteFd(&io, sockfd);
    }

    xIOBufferDeinit(&io);

    // At shutdown
    xIOBlockPoolDrain();
}

Use Cases

  1. HTTP Response Body — The xhttp module uses xIOBuffer to accumulate response chunks from libcurl without copying between buffers.

  2. Protocol Framing — Use xIOBufferCut() to split headers from body in a zero-copy fashion, then process each part independently.

  3. Data Pipeline — Chain multiple processing stages that each append to or cut from xIOBuffer instances, sharing blocks to minimize copies.

Best Practices

  • Call xIOBlockPoolWarmup() at startup to pre-allocate blocks and avoid allocation spikes during initial traffic.
  • Call xIOBlockPoolDrain() at shutdown for clean valgrind reports.
  • Use xIOBufferAppendIOBuffer() instead of copying when combining buffers. It transfers ownership without data copies.
  • Use xIOBufferCut() for protocol parsing. It's more efficient than xIOBufferRead() when you need to pass the cut data to another component.
  • Monitor xIOBufferRefCount() to understand memory fragmentation. Many small refs may indicate suboptimal block utilization.

Comparison with Other Libraries

Featurexbuf io.hbrpc IOBufNetty ByteBufGo bytes.Buffer
ArchitectureBlock-chain (ref array)Block-chain (linked list)Composite bufferContiguous slice
Block Size8KB (configurable)8KBConfigurableN/A
Reference CountingAtomic (per block)Atomic (per block)Atomic (per buffer)GC
Zero-Copy SplitxIOBufferCutcutnsliceNo
Zero-Copy AppendxIOBufferAppendIOBufferappend(IOBuf)addComponentNo
Block PoolTreiber stack (lock-free)Thread-local + globalArena allocatorN/A
Scatter-Gather I/Owritev via ReadIovwritev via pappendnioBuffersNo
Inline Optimization8 inline refsNoNoN/A
LanguageC99C++JavaGo

Key Differentiator: xbuf's xIOBuffer combines brpc-style block-chain architecture with a lock-free Treiber stack block pool and inline ref optimization. The zero-copy Cut and AppendIOBuffer operations make it ideal for protocol parsing and data pipeline scenarios in C.

Benchmark

Environment: Apple M3 Pro, 36 GB RAM, macOS 26.4, Release build (-O2). Source: xbuf/io_bench.cpp

BenchmarkSizeTime (ns)CPU (ns)Throughput
BM_IOBuffer_Append643,7203,72016.0 GiB/s
BM_IOBuffer_Append2567,5697,56831.5 GiB/s
BM_IOBuffer_Append1,02422,34122,34042.7 GiB/s
BM_IOBuffer_Append4,09679,79679,79447.8 GiB/s
BM_IOBuffer_Append8,192187,167187,16540.8 GiB/s
BM_IOBuffer_AppendConsume645,2305,23011.4 GiB/s
BM_IOBuffer_AppendConsume2568,2328,23229.0 GiB/s
BM_IOBuffer_AppendConsume1,02423,04023,04041.4 GiB/s
BM_IOBuffer_Cut8,19216716745.6 GiB/s
BM_IOBuffer_Cut65,5361,6511,65137.0 GiB/s
BM_IOBuffer_Cut262,1448,1228,12230.1 GiB/s
BM_IOBuffer_AppendIOBuffer1,0243,1963,19629.8 GiB/s
BM_IOBuffer_AppendIOBuffer4,0969,3079,30741.0 GiB/s
BM_IOBuffer_AppendIOBuffer8,19217,60417,60243.3 GiB/s
BM_IOBuffer_BlockPool8.918.89

Key Observations:

  • Append peaks at ~48 GiB/s for 4KB chunks. The slight drop at 8KB reflects block boundary crossing overhead.
  • Cut (zero-copy split) is extremely fast — 167ns for 8KB — because it only manipulates reference metadata, not data. This validates the block-chain architecture for protocol parsing.
  • AppendIOBuffer (zero-copy concatenation) achieves ~43 GiB/s, confirming that block ownership transfer avoids data copies.
  • BlockPool acquire/release cycle takes ~9ns, showing the lock-free Treiber stack's efficiency for block recycling.