Event Loop — Benchmark Report

Micro-benchmark comparison of moo's xEventLoop against libuv 1.52.1 across three dimensions: cross-thread wake latency, timer scheduling, and offload round-trip (submit work → done callback on loop thread).

Test Environment

Item	Value
CPU	Apple M3 Pro (12 cores)
Memory	36 GB
OS	macOS 26.4 (Darwin)
Compiler	Apple Clang 17.0.0
Build	Release (`-O2`)
Framework	Google Benchmark
Event Backend	kqueue (moo), kqueue (libuv)
Workers	4 threads (for offload benchmarks)

Results

Core Operations (moo only)

Benchmark	Time (ns)	CPU (ns)	Iterations
`BM_EventLoop_CreateDestroy`	700	700	974,157
`BM_EventLoop_WakeLatency`	413	413	1,717,088
`BM_EventLoop_PipeAddDel`	1,144	1,144	612,118

Create/Destroy takes ~700ns — reduced from ~2.8µs after eliminating the wake pipe (no more pipe() + two extra fds). Reflects only kqueue fd creation + internal structure allocation.
Wake latency is ~413ns per wake+wait cycle via EVFILT_USER, down from ~879ns with the old pipe mechanism — a 2.1× improvement.
Add/Del cycle (register + unregister a pipe fd) takes ~1.1µs — low overhead for dynamic fd management.

Wake Latency — moo vs libuv

	moo	libuv	Ratio
Time	413 ns	417 ns	moo 1.01× faster

moo now uses EVFILT_USER on kqueue (macOS) and eventfd on epoll (Linux) for wake notification, replacing the previous pipe-based mechanism. Combined with an atomic wake_pending flag for coalescing, this eliminates all pipe overhead. The result is effectively tied with libuv (413ns vs 417ns), closing the previous 2.1× gap entirely.

Timer Scheduling

moo — Timer

Benchmark	Time (ns)	CPU (ns)	Throughput
`BM_EventLoop_TimerSingle`	461	461	2.17M items/s
`BM_EventLoop_TimerBatch/10`	750	750	13.34M items/s
`BM_EventLoop_TimerBatch/100`	3,714	3,714	26.93M items/s
`BM_EventLoop_TimerBatch/1000`	43,550	43,545	22.96M items/s

libuv — Timer

Benchmark	Time (ns)	CPU (ns)	Throughput
`BM_Libuv_TimerSingle`	12,361	1,517	659.2k items/s
`BM_Libuv_TimerBatch/10`	12,613	1,787	5.60M items/s
`BM_Libuv_TimerBatch/100`	16,412	5,311	18.83M items/s
`BM_Libuv_TimerBatch/1000`	79,721	68,659	14.56M items/s

Comparison — Timer (CPU time)

Batch Size	moo (CPU ns)	libuv (CPU ns)	Ratio
1	461	1,517	moo 3.29× faster
10	750	1,787	moo 2.38× faster
100	3,714	5,311	moo 1.43× faster
1,000	43,545	68,659	moo 1.58× faster

Analysis:

Single timer — moo wins at ~461ns vs libuv's ~1.5µs (3.3× faster). moo's timer path is simpler: heap push + xEventWait pops and fires in one call. libuv's uv_timer_start + uv_run(UV_RUN_ONCE) has more overhead per invocation.
Batch timers — moo now wins across all batch sizes, a dramatic reversal from the previous results where libuv was 4–5× faster. The key optimizations that closed the gap:
1. Batch pop with single lock: Timer dispatch now acquires timer_mu once, pops all expired timers into a local list, releases the lock, then fires them — eliminating N lock/unlock cycles.
2. Timer struct freelist: Timer structs are recycled via a lock-free freelist, eliminating malloc/free per timer operation.
3. Throughput: At batch size 1000, moo achieves 22.96M items/s vs libuv's 14.56M items/s — 1.58× faster.

Offload Round-Trip (Submit → Done Callback)

moo — Offload

Benchmark	Time (ns)	CPU (ns)	Throughput
`BM_EventLoop_OffloadSingle`	6,401	3,785	264.2k items/s
`BM_EventLoop_OffloadBatch/10`	14,989	12,243	816.8k items/s
`BM_EventLoop_OffloadBatch/100`	56,563	46,534	2.15M items/s
`BM_EventLoop_OffloadBatch/1000`	496,393	456,426	2.19M items/s

libuv — Offload

Benchmark	Time (ns)	CPU (ns)	Throughput
`BM_Libuv_OffloadSingle`	5,843	3,449	290.0k items/s
`BM_Libuv_OffloadBatch/10`	13,909	10,239	976.7k items/s
`BM_Libuv_OffloadBatch/100`	35,838	30,061	3.33M items/s
`BM_Libuv_OffloadBatch/1000`	242,694	218,513	4.58M items/s

Comparison — Offload (CPU time)

Batch Size	moo (CPU ns)	libuv (CPU ns)	Ratio
1	3,785	3,449	libuv 1.10× faster
10	12,243	10,239	libuv 1.20× faster
100	46,534	30,061	libuv 1.55× faster
1,000	456,426	218,513	libuv 2.09× faster

Analysis:

Single offload — Nearly tied (~1.10× gap, narrowed from 1.16×). Both are dominated by the same bottleneck: waking a sleeping worker thread via kernel syscall.
Batch offload — libuv remains ~2× faster at scale. The gap has narrowed slightly at smaller batch sizes (1.20× at 10, down from 1.45×) thanks to wake coalescing and work item pooling. The remaining gap is primarily due to:
1. Completion notification: libuv workers post to an async handle and the loop drains all completions in one uv__work_done() call. moo uses an MPSC queue with atomic wake coalescing.
2. Allocation model: libuv's uv_work_t is caller-allocated (stack or embedded). moo uses a lock-free freelist pool, which is faster than malloc but still has CAS overhead.

Summary

Dimension	Before Optimization	After Optimization	vs libuv
Wake Latency	879 ns (libuv 2.1× faster)	413 ns	Tied (moo 1.01× faster)
Timer (single)	974 ns (moo 1.6× faster)	461 ns	moo 3.3× faster
Timer (batch ×1000)	318,805 ns (libuv 4.3× faster)	43,545 ns	moo 1.6× faster
Offload (single)	4,110 ns (libuv 1.2× faster)	3,785 ns	libuv 1.1× faster (tied)
Offload (batch ×1000)	507,346 ns (libuv 1.95× faster)	456,426 ns	libuv 2.1× faster

Key Improvements

Optimization	Impact
`EVFILT_USER` / `eventfd` wake	Wake latency 2.1× faster (879→413ns), closed gap with libuv
Timer batch-pop (single lock)	Timer batch/1000 7.3× faster (318µs→43µs), now beats libuv
Timer struct freelist	Eliminated per-timer malloc, contributes to batch improvement
Work item freelist (Treiber stack)	Reduced offload overhead, narrowed gap at small batch sizes
Wake coalescing (atomic flag)	Reduced redundant wake syscalls from N to 1 in batch scenarios

Completed Optimizations

~~Timer dispatch without per-pop locking~~: ✅ Done — Acquire timer_mu once, pop all expired timers into a local list, release the lock, then fire them. Eliminates N lock/unlock cycles for N expired timers.
~~Timer struct pooling~~: ✅ Done — Timer structs are recycled via a lock-free freelist (event_timer_alloc() / event_timer_free()), eliminating malloc/free per timer.
~~Wake coalescing for offload~~: ✅ Done — An atomic wake_pending flag ensures only the first completing worker performs the actual wake syscall. Subsequent workers see the flag already set and skip the syscall entirely.
~~Caller-allocated work items~~: ✅ Done — Work items are pooled via a lock-free Treiber stack (event_work_alloc() / event_work_free()), eliminating per-submit malloc. Equivalent to libuv's zero-alloc model.
~~Lighter wake mechanism~~: ✅ Done — kqueue backend uses EVFILT_USER (zero fd, no pipe) for wake; epoll backend uses eventfd (single fd) instead of a pipe pair. Poll backend retains the pipe as a POSIX fallback.

moo Documentation