mpsc.h — Lock-Free MPSC Queue

Introduction

mpsc.h provides a lock-free, intrusive multi-producer single-consumer (MPSC) queue. Multiple threads can push nodes concurrently without locks, while a single consumer thread pops nodes. It is the backbone of xbase's poll-mode timer dispatch and the event loop's offload completion queue.

Design Philosophy

Intrusive Design — Nodes embed an xMpsc struct directly, avoiding heap allocation per enqueue. This is critical for hot paths like timer expiry and offload completion where allocation overhead would be unacceptable.
Lock-Free Push — xMpscPush() uses a single atomic exchange (xAtomicXchg) on the tail pointer, making it wait-free for producers. No mutex, no CAS retry loop.
Single-Consumer Pop — xMpscPop() is designed for exactly one consumer thread. It uses atomic loads and a single CAS for the edge case of popping the last element. This simplification avoids the ABA problem that plagues multi-consumer designs.
Minimal Memory Ordering — The implementation uses xAtomicAcqRel for the exchange and xAtomicAcquire/xAtomicRelease for loads/stores, providing the minimum ordering needed for correctness without the overhead of sequential consistency.

Architecture

graph LR
    P1["Producer 1"] -->|"xMpscPush"| TAIL["tail"]
    P2["Producer 2"] -->|"xMpscPush"| TAIL
    P3["Producer 3"] -->|"xMpscPush"| TAIL

    HEAD["head"] -->|"xMpscPop"| C["Consumer"]

    subgraph "Queue"
        HEAD --> N1["Node 1"] --> N2["Node 2"] --> N3["Node 3"]
        N3 --- TAIL
    end

    style P1 fill:#4a90d9,color:#fff
    style P2 fill:#4a90d9,color:#fff
    style P3 fill:#4a90d9,color:#fff
    style C fill:#50b86c,color:#fff

Implementation Details

Data Structure

XDEF_STRUCT(xMpsc) {
    xMpsc *volatile next;  // Pointer to next node
};

The queue is represented by two external pointers:

head — Points to the oldest node (consumer reads from here)
tail — Points to the newest node (producers append here)

Push Algorithm

void xMpscPush(xMpsc **head, xMpsc **tail, xMpsc *node) {
    node->next = NULL;
    xMpsc *prev_tail = xAtomicXchg(tail, node, xAtomicAcqRel);
    if (prev_tail)
        prev_tail->next = node;  // Link to previous tail
    else
        xAtomicStore(head, node, xAtomicRelease);  // First node
}

The key insight: xAtomicXchg atomically replaces the tail and returns the old value. If the old tail was non-NULL, we link it to the new node. If it was NULL (empty queue), we also update the head.

Pop Algorithm

The pop operation handles three cases:

Empty queue — head is NULL, return NULL.
Multiple nodes — Advance head to head->next, return old head.
Single node — CAS tail to NULL. If CAS succeeds, also CAS head to NULL. If CAS fails (concurrent push in progress), spin until head->next becomes non-NULL.

flowchart TD
    START["xMpscPop()"]
    CHECK_HEAD{"head == NULL?"}
    EMPTY["Return NULL"]
    CHECK_NEXT{"head->next == NULL?"}
    MULTI["Advance head<br/>Return old head"]
    CAS_TAIL{"CAS tail → NULL?"}
    CAS_HEAD["CAS head → NULL<br/>Return old head"]
    SPIN["Spin until head->next != NULL"]
    ADVANCE["Advance head<br/>Return old head"]

    START --> CHECK_HEAD
    CHECK_HEAD -->|Yes| EMPTY
    CHECK_HEAD -->|No| CHECK_NEXT
    CHECK_NEXT -->|No| MULTI
    CHECK_NEXT -->|Yes| CAS_TAIL
    CAS_TAIL -->|Success| CAS_HEAD
    CAS_TAIL -->|Fail: concurrent push| SPIN
    SPIN --> ADVANCE

    style EMPTY fill:#e74c3c,color:#fff
    style MULTI fill:#50b86c,color:#fff
    style CAS_HEAD fill:#50b86c,color:#fff
    style ADVANCE fill:#50b86c,color:#fff

Memory Ordering Analysis

Operation	Ordering	Reason
`xAtomicXchg(tail, node)`	`AcqRel`	Acquire: see previous tail's `next` field. Release: make `node` visible to consumer.
`xAtomicStore(head, node)`	`Release`	Make the new head visible to the consumer.
`xAtomicLoad(head)`	`Acquire`	See the node written by the producer.
`xAtomicLoad(&head->next)`	`Acquire`	See the next pointer written by the producer.
`xAtomicCasStrong(tail, ...)`	`Release`	Publish the NULL tail to concurrent pushers.

Thread Safety

xMpscPush() — Thread-safe (multiple producers).
xMpscPop() — Single-consumer only. Must not be called concurrently.
xMpscEmpty() — Thread-safe (atomic load).

API Reference

Types

Type	Description
`xMpsc`	Intrusive queue node. Embed in your struct and use `xContainerOf()` to recover the enclosing struct.

Functions

Function	Signature	Description	Thread Safety
`xMpscPush`	`void xMpscPush(xMpsc head, xMpsc tail, xMpsc *node)`	Push a node. Wait-free for producers.	Thread-safe (multi-producer)
`xMpscPop`	`xMpsc xMpscPop(xMpsc head, xMpsc *tail)`	Pop the oldest node. Returns NULL if empty.	Single-consumer only
`xMpscEmpty`	`bool xMpscEmpty(xMpsc **head)`	Check if the queue is empty.	Thread-safe

Usage Examples

Basic Producer-Consumer

#include <stdio.h>
#include <pthread.h>
#include <xbase/mpsc.h>
#include <xbase/base.h>

typedef struct {
    xMpsc node;   // Must embed xMpsc
    int   value;
} Message;

static xMpsc *g_head = NULL;
static xMpsc *g_tail = NULL;

static void *producer(void *arg) {
    Message *msg = (Message *)arg;
    xMpscPush(&g_head, &g_tail, &msg->node);
    return NULL;
}

int main(void) {
    Message msgs[] = {
        { .value = 1 },
        { .value = 2 },
        { .value = 3 },
    };

    // Push from multiple threads
    pthread_t threads[3];
    for (int i = 0; i < 3; i++)
        pthread_create(&threads[i], NULL, producer, &msgs[i]);
    for (int i = 0; i < 3; i++)
        pthread_join(threads[i], NULL);

    // Pop from single consumer
    xMpsc *node;
    while ((node = xMpscPop(&g_head, &g_tail)) != NULL) {
        Message *msg = xContainerOf(node, Message, node);
        printf("Received: %d\n", msg->value);
    }

    return 0;
}

Use Cases

Timer Poll Mode — timer.h uses the MPSC queue in poll mode to pass expired timer entries from the timer thread to the polling thread without locks.
Event Loop Offload — The event loop's offload mechanism (event.h) uses an MPSC queue to deliver completed work items from worker threads to the event loop thread.
xlog Async Logger — logger.h uses the MPSC queue to pass log messages from application threads to the logger's flush thread.

Best Practices

Embed xMpsc in your struct. Don't allocate xMpsc nodes separately. Use xContainerOf() to recover the enclosing struct after popping.
Initialize head and tail to NULL. An empty queue has both pointers set to NULL.
Only one thread may call xMpscPop(). The single-consumer constraint is fundamental to the algorithm's correctness. Violating it causes data races.
Don't access a node after pushing it. Once pushed, the node is owned by the queue until popped.

Comparison with Other Libraries

Feature	xbase mpsc.h	Dmitry Vyukov MPSC	`concurrentqueue` (C++)	Linux `llist`
Design	Intrusive, lock-free	Intrusive, lock-free	Non-intrusive, lock-free	Intrusive, lock-free
Push	Wait-free (1 atomic xchg)	Wait-free (1 atomic xchg)	Lock-free (CAS loop)	Wait-free (1 atomic xchg)
Pop	Lock-free (single consumer)	Lock-free (single consumer)	Lock-free (multi-consumer)	Batch pop (splice)
Memory Ordering	AcqRel / Acquire / Release	SeqCst	Relaxed + fences	Varies
Allocation	None (intrusive)	None (intrusive)	Per-element (internal)	None (intrusive)
Multi-Consumer	No	No	Yes	No (batch only)
Language	C99	C/C++	C++11	C (kernel)

Key Differentiator: xbase's MPSC queue is minimal and intrusive — zero allocation overhead, wait-free push, and carefully chosen memory orderings. It's designed specifically for the single-consumer patterns found in event loops and timer systems.

Benchmark

Environment: Apple M3 Pro, 36 GB RAM, macOS 26.4, Release build (-O2). Source: xbase/mpsc_bench.cpp

Benchmark	Time (ns)	CPU (ns)	Iterations	Throughput
`BM_Mpsc_SingleProducer`	3,712	3,712	187,897	275.9 M items/s
`BM_Mpsc_MultiProducer/2`	609,432	87,797	8,075	227.8 M items/s
`BM_Mpsc_MultiProducer/4`	1,327,965	148,356	4,768	269.6 M items/s
`BM_Mpsc_MultiProducer/8`	4,466,805	292,260	1,000	273.7 M items/s

Key Observations:

Single-producer push/pop achieves ~276M items/s, demonstrating the minimal overhead of the lock-free algorithm.
Multi-producer scaling maintains ~270M items/s aggregate throughput even with 8 concurrent producers, showing excellent scalability. The wall-clock time increases due to thread synchronization overhead, but per-CPU throughput remains stable.
The gap between wall-clock time and CPU time in multi-producer benchmarks reflects the cost of thread creation and barrier synchronization, not the queue operations themselves.

xKit Documentation