io.h — Reference-Counted Block-Chain I/O Buffer
Introduction
io.h provides xIOBuffer, a non-contiguous byte buffer composed of a chain of reference-counted memory blocks. It supports zero-copy split, append, and scatter-gather I/O (readv/writev). Inspired by brpc's IOBuf, it is designed for high-throughput network I/O where avoiding memory copies is critical.
Design Philosophy
-
Block-Chain Architecture — Data is stored across multiple fixed-size blocks (default 8KB each), linked through a reference array. This avoids large contiguous allocations and enables zero-copy operations.
-
Reference Counting — Each
xIOBlockis reference-counted. MultiplexIOBufferinstances can share the same block (e.g., after aCutoperation). Blocks are freed (returned to pool) when the last reference is released. -
Zero-Copy Operations —
xIOBufferAppendIOBuffer()transfers block references without copying data.xIOBufferCut()splits a buffer by adjusting offsets and sharing blocks at the boundary. -
Lock-Free Block Pool — Released blocks are returned to a global Treiber stack (lock-free) for reuse, avoiding
malloc/freeoverhead in steady state. -
Inline Ref Array — Small buffers (≤ 8 refs) use an inline array, avoiding heap allocation for the ref array itself. Larger buffers transition to a heap-allocated array.
Architecture
graph TD
subgraph "xIOBuffer API"
APPEND["Append / AppendStr"]
APPEND_IO["AppendIOBuffer<br/>(zero-copy)"]
READ["Read / CopyTo"]
CUT["Cut<br/>(zero-copy split)"]
CONSUME["Consume"]
IO_READ["ReadFd"]
IO_WRITE["WriteFd<br/>(writev)"]
end
subgraph "Block Management"
ACQUIRE["xIOBlockAcquire"]
RETAIN["xIOBlockRetain"]
RELEASE["xIOBlockRelease"]
end
subgraph "Block Pool (Treiber Stack)"
POOL["g_pool_head"]
WARMUP["xIOBlockPoolWarmup"]
DRAIN["xIOBlockPoolDrain"]
end
APPEND --> ACQUIRE
IO_READ --> ACQUIRE
CUT --> RETAIN
CONSUME --> RELEASE
READ --> RELEASE
ACQUIRE --> POOL
RELEASE --> POOL
WARMUP --> POOL
DRAIN --> POOL
style POOL fill:#f5a623,color:#fff
Implementation Details
Block Structure
XDEF_STRUCT(xIOBlock) {
size_t refs; // Reference count (atomic)
size_t size; // Usable data size
char data[XIOBUFFER_BLOCK_SIZE]; // 8KB inline data
};
Reference Structure
XDEF_STRUCT(xIOBufferRef) {
xIOBlock *block; // Pointer to the underlying block
size_t offset; // Start offset within block->data
size_t length; // Number of valid bytes from offset
};
IOBuffer Structure
XDEF_STRUCT(xIOBuffer) {
xIOBufferRef inlined[XIOBUFFER_INLINE_REFS]; // Inline ref storage (8)
xIOBufferRef *refs; // Pointer to ref array (inlined or heap)
size_t nrefs; // Number of active refs
size_t cap; // Capacity of refs array
size_t nbytes; // Total logical byte count (cached)
};
Block-Chain Architecture
graph TD
subgraph "xIOBuffer"
REF1["Ref 0<br/>block=A, off=0, len=8192"]
REF2["Ref 1<br/>block=B, off=0, len=8192"]
REF3["Ref 2<br/>block=C, off=0, len=3000"]
end
subgraph "Shared Blocks"
A["xIOBlock A<br/>refs=1, 8KB"]
B["xIOBlock B<br/>refs=2, 8KB"]
C["xIOBlock C<br/>refs=1, 8KB"]
end
REF1 --> A
REF2 --> B
REF3 --> C
subgraph "Another xIOBuffer (after Cut)"
REF4["Ref 0<br/>block=B, off=4096, len=4096"]
end
REF4 --> B
style A fill:#4a90d9,color:#fff
style B fill:#f5a623,color:#fff
style C fill:#50b86c,color:#fff
Treiber Stack Block Pool
The global block pool uses a lock-free Treiber stack:
// Pool node overlays xIOBlock memory
XDEF_STRUCT(PoolNode_) {
PoolNode_ *next;
};
static PoolNode_ *volatile g_pool_head = NULL;
Push (return to pool):
do {
head = atomic_load(g_pool_head)
node->next = head
} while (!CAS(g_pool_head, head, node))
Pop (acquire from pool):
do {
head = atomic_load(g_pool_head)
if (!head) return malloc(new block)
next = head->next
} while (!CAS(g_pool_head, head, next))
return head
Zero-Copy Cut
xIOBufferCut(io, dst, n) moves the first n bytes from io to dst:
- Fully consumed refs — Ownership transfers directly (no refcount change).
- Boundary ref — The block is shared:
xIOBlockRetain()increments the refcount, and both buffers hold a ref with different offset/length.
flowchart TD
CUT["xIOBufferCut(io, dst, n)"]
LOOP{"More bytes to cut?"}
FULL{"ref.length <= remaining?"}
TRANSFER["Transfer entire ref to dst<br/>(no refcount change)"]
SPLIT["Share block: Retain + split ref<br/>dst gets [offset, chunk]<br/>io keeps [offset+chunk, rest]"]
SHIFT["Shift consumed refs out of io"]
DONE["Update nbytes for both"]
CUT --> LOOP
LOOP -->|Yes| FULL
FULL -->|Yes| TRANSFER --> LOOP
FULL -->|No| SPLIT --> SHIFT --> DONE
LOOP -->|No| SHIFT
style TRANSFER fill:#50b86c,color:#fff
style SPLIT fill:#f5a623,color:#fff
Append Strategy
xIOBufferAppend(io, data, len):
- First tries to fill the tail block's remaining space (avoids allocating a new block for small appends).
- Allocates new blocks for remaining data, each up to
XIOBUFFER_BLOCK_SIZEbytes.
API Reference
Configuration
| Macro | Default | Description |
|---|---|---|
XIOBUFFER_BLOCK_SIZE | 8192 | Block data size in bytes |
XIOBUFFER_INLINE_REFS | 8 | Inline ref array capacity |
Block API
| Function | Signature | Description | Thread Safety |
|---|---|---|---|
xIOBlockAcquire | xIOBlock *xIOBlockAcquire(void) | Get a block from pool (or malloc). refs=1. | Thread-safe (lock-free pool) |
xIOBlockRetain | void xIOBlockRetain(xIOBlock *blk) | Increment refcount. | Thread-safe (atomic) |
xIOBlockRelease | void xIOBlockRelease(xIOBlock *blk) | Decrement refcount; return to pool at 0. | Thread-safe (atomic + lock-free pool) |
xIOBlockPoolWarmup | xErrno xIOBlockPoolWarmup(size_t n) | Pre-allocate n blocks into pool. | Thread-safe |
xIOBlockPoolDrain | void xIOBlockPoolDrain(void) | Free all pooled blocks. Call at shutdown. | Not thread-safe (no concurrent use) |
IOBuffer Lifecycle
| Function | Signature | Description | Thread Safety |
|---|---|---|---|
xIOBufferInit | void xIOBufferInit(xIOBuffer *io) | Initialize an empty IOBuffer. | Not thread-safe |
xIOBufferDeinit | void xIOBufferDeinit(xIOBuffer *io) | Release all refs and free ref array. | Not thread-safe |
xIOBufferReset | void xIOBufferReset(xIOBuffer *io) | Release all refs, keep ref array. | Not thread-safe |
IOBuffer Query
| Function | Signature | Description | Thread Safety |
|---|---|---|---|
xIOBufferLen | size_t xIOBufferLen(const xIOBuffer *io) | Total readable bytes. | Not thread-safe |
xIOBufferEmpty | bool xIOBufferEmpty(const xIOBuffer *io) | True if no data. | Not thread-safe |
xIOBufferRefCount | size_t xIOBufferRefCount(const xIOBuffer *io) | Number of block refs. | Not thread-safe |
IOBuffer Write
| Function | Signature | Description | Thread Safety |
|---|---|---|---|
xIOBufferAppend | xErrno xIOBufferAppend(xIOBuffer *io, const void *data, size_t len) | Append bytes (allocates blocks as needed). | Not thread-safe |
xIOBufferAppendStr | xErrno xIOBufferAppendStr(xIOBuffer *io, const char *str) | Append C string. | Not thread-safe |
xIOBufferAppendIOBuffer | xErrno xIOBufferAppendIOBuffer(xIOBuffer *io, xIOBuffer *other) | Zero-copy: move all refs from other. | Not thread-safe |
IOBuffer Read
| Function | Signature | Description | Thread Safety |
|---|---|---|---|
xIOBufferRead | size_t xIOBufferRead(xIOBuffer *io, void *out, size_t len) | Copy and consume bytes. | Not thread-safe |
xIOBufferCut | size_t xIOBufferCut(xIOBuffer *io, xIOBuffer *dst, size_t n) | Zero-copy split: move first n bytes to dst. | Not thread-safe |
xIOBufferConsume | size_t xIOBufferConsume(xIOBuffer *io, size_t n) | Discard first n bytes. | Not thread-safe |
xIOBufferCopyTo | size_t xIOBufferCopyTo(const xIOBuffer *io, void *out) | Linearize: copy all data to contiguous buffer. | Not thread-safe |
IOBuffer I/O
| Function | Signature | Description | Thread Safety |
|---|---|---|---|
xIOBufferReadIov | int xIOBufferReadIov(const xIOBuffer *io, struct iovec *iov, int max_iov) | Fill iovecs for writev(). | Not thread-safe |
xIOBufferReadFd | ssize_t xIOBufferReadFd(xIOBuffer *io, int fd) | Read from fd into IOBuffer. | Not thread-safe |
xIOBufferWriteFd | ssize_t xIOBufferWriteFd(xIOBuffer *io, int fd) | Write to fd using writev(). | Not thread-safe |
Usage Examples
Basic Usage
#include <stdio.h>
#include <xbuf/io.h>
int main(void) {
xIOBuffer io;
xIOBufferInit(&io);
// Append data (may span multiple blocks)
xIOBufferAppend(&io, "Hello, ", 7);
xIOBufferAppend(&io, "IOBuffer!", 9);
printf("Length: %zu, Refs: %zu\n",
xIOBufferLen(&io), xIOBufferRefCount(&io));
// Linearize for processing
char buf[64];
xIOBufferCopyTo(&io, buf);
printf("Content: %.*s\n", (int)xIOBufferLen(&io), buf);
xIOBufferDeinit(&io);
return 0;
}
Zero-Copy Split (Protocol Parsing)
#include <xbuf/io.h>
void parse_protocol(xIOBuffer *io) {
// Cut the 4-byte header from the front
xIOBuffer header;
xIOBufferInit(&header);
size_t cut = xIOBufferCut(io, &header, 4);
if (cut == 4) {
char hdr[4];
xIOBufferRead(&header, hdr, 4);
// Parse header...
// io now contains only the body (zero-copy!)
}
xIOBufferDeinit(&header);
}
High-Throughput Network I/O
#include <xbuf/io.h>
void handle_data(int sockfd) {
// Pre-warm the block pool at startup
xIOBlockPoolWarmup(64);
xIOBuffer io;
xIOBufferInit(&io);
// Read from socket (allocates blocks from pool)
ssize_t n = xIOBufferReadFd(&io, sockfd);
if (n > 0) {
// Write back using scatter-gather I/O
xIOBufferWriteFd(&io, sockfd);
}
xIOBufferDeinit(&io);
// At shutdown
xIOBlockPoolDrain();
}
Use Cases
-
HTTP Response Body — The
xhttpmodule usesxIOBufferto accumulate response chunks from libcurl without copying between buffers. -
Protocol Framing — Use
xIOBufferCut()to split headers from body in a zero-copy fashion, then process each part independently. -
Data Pipeline — Chain multiple processing stages that each append to or cut from
xIOBufferinstances, sharing blocks to minimize copies.
Best Practices
- Call
xIOBlockPoolWarmup()at startup to pre-allocate blocks and avoid allocation spikes during initial traffic. - Call
xIOBlockPoolDrain()at shutdown for clean valgrind reports. - Use
xIOBufferAppendIOBuffer()instead of copying when combining buffers. It transfers ownership without data copies. - Use
xIOBufferCut()for protocol parsing. It's more efficient thanxIOBufferRead()when you need to pass the cut data to another component. - Monitor
xIOBufferRefCount()to understand memory fragmentation. Many small refs may indicate suboptimal block utilization.
Comparison with Other Libraries
| Feature | xbuf io.h | brpc IOBuf | Netty ByteBuf | Go bytes.Buffer |
|---|---|---|---|---|
| Architecture | Block-chain (ref array) | Block-chain (linked list) | Composite buffer | Contiguous slice |
| Block Size | 8KB (configurable) | 8KB | Configurable | N/A |
| Reference Counting | Atomic (per block) | Atomic (per block) | Atomic (per buffer) | GC |
| Zero-Copy Split | xIOBufferCut | cutn | slice | No |
| Zero-Copy Append | xIOBufferAppendIOBuffer | append(IOBuf) | addComponent | No |
| Block Pool | Treiber stack (lock-free) | Thread-local + global | Arena allocator | N/A |
| Scatter-Gather I/O | writev via ReadIov | writev via pappend | nioBuffers | No |
| Inline Optimization | 8 inline refs | No | No | N/A |
| Language | C99 | C++ | Java | Go |
Key Differentiator: xbuf's xIOBuffer combines brpc-style block-chain architecture with a lock-free Treiber stack block pool and inline ref optimization. The zero-copy Cut and AppendIOBuffer operations make it ideal for protocol parsing and data pipeline scenarios in C.
Benchmark
Environment: Apple M3 Pro, 36 GB RAM, macOS 26.4, Release build (
-O2). Source:xbuf/io_bench.cpp
| Benchmark | Size | Time (ns) | CPU (ns) | Throughput |
|---|---|---|---|---|
BM_IOBuffer_Append | 64 | 3,720 | 3,720 | 16.0 GiB/s |
BM_IOBuffer_Append | 256 | 7,569 | 7,568 | 31.5 GiB/s |
BM_IOBuffer_Append | 1,024 | 22,341 | 22,340 | 42.7 GiB/s |
BM_IOBuffer_Append | 4,096 | 79,796 | 79,794 | 47.8 GiB/s |
BM_IOBuffer_Append | 8,192 | 187,167 | 187,165 | 40.8 GiB/s |
BM_IOBuffer_AppendConsume | 64 | 5,230 | 5,230 | 11.4 GiB/s |
BM_IOBuffer_AppendConsume | 256 | 8,232 | 8,232 | 29.0 GiB/s |
BM_IOBuffer_AppendConsume | 1,024 | 23,040 | 23,040 | 41.4 GiB/s |
BM_IOBuffer_Cut | 8,192 | 167 | 167 | 45.6 GiB/s |
BM_IOBuffer_Cut | 65,536 | 1,651 | 1,651 | 37.0 GiB/s |
BM_IOBuffer_Cut | 262,144 | 8,122 | 8,122 | 30.1 GiB/s |
BM_IOBuffer_AppendIOBuffer | 1,024 | 3,196 | 3,196 | 29.8 GiB/s |
BM_IOBuffer_AppendIOBuffer | 4,096 | 9,307 | 9,307 | 41.0 GiB/s |
BM_IOBuffer_AppendIOBuffer | 8,192 | 17,604 | 17,602 | 43.3 GiB/s |
BM_IOBuffer_BlockPool | — | 8.91 | 8.89 | — |
Key Observations:
- Append peaks at ~48 GiB/s for 4KB chunks. The slight drop at 8KB reflects block boundary crossing overhead.
- Cut (zero-copy split) is extremely fast — 167ns for 8KB — because it only manipulates reference metadata, not data. This validates the block-chain architecture for protocol parsing.
- AppendIOBuffer (zero-copy concatenation) achieves ~43 GiB/s, confirming that block ownership transfer avoids data copies.
- BlockPool acquire/release cycle takes ~9ns, showing the lock-free Treiber stack's efficiency for block recycling.