Micro-benchmark comparison of moo's xEventLoop against libuv 1.52.1 across three dimensions: cross-thread wake latency, timer scheduling, and offload round-trip (submit work → done callback on loop thread).
Create/Destroy takes ~700ns — reduced from ~2.8µs after eliminating the wake pipe (no more pipe() + two extra fds). Reflects only kqueue fd creation + internal structure allocation.
Wake latency is ~413ns per wake+wait cycle via EVFILT_USER, down from ~879ns with the old pipe mechanism — a 2.1× improvement.
Add/Del cycle (register + unregister a pipe fd) takes ~1.1µs — low overhead for dynamic fd management.
moo now uses EVFILT_USER on kqueue (macOS) and eventfd on epoll (Linux) for wake notification, replacing the previous pipe-based mechanism. Combined with an atomic wake_pending flag for coalescing, this eliminates all pipe overhead. The result is effectively tied with libuv (413ns vs 417ns), closing the previous 2.1× gap entirely.
Single timer — moo wins at ~461ns vs libuv's ~1.5µs (3.3× faster). moo's timer path is simpler: heap push + xEventWait pops and fires in one call. libuv's uv_timer_start + uv_run(UV_RUN_ONCE) has more overhead per invocation.
Batch timers — moo now wins across all batch sizes, a dramatic reversal from the previous results where libuv was 4–5× faster. The key optimizations that closed the gap:
Batch pop with single lock: Timer dispatch now acquires timer_mu once, pops all expired timers into a local list, releases the lock, then fires them — eliminating N lock/unlock cycles.
Timer struct freelist: Timer structs are recycled via a lock-free freelist, eliminating malloc/free per timer operation.
Throughput: At batch size 1000, moo achieves 22.96M items/s vs libuv's 14.56M items/s — 1.58× faster.
Single offload — Nearly tied (~1.10× gap, narrowed from 1.16×). Both are dominated by the same bottleneck: waking a sleeping worker thread via kernel syscall.
Batch offload — libuv remains ~2× faster at scale. The gap has narrowed slightly at smaller batch sizes (1.20× at 10, down from 1.45×) thanks to wake coalescing and work item pooling. The remaining gap is primarily due to:
Completion notification: libuv workers post to an async handle and the loop drains all completions in one uv__work_done() call. moo uses an MPSC queue with atomic wake coalescing.
Allocation model: libuv's uv_work_t is caller-allocated (stack or embedded). moo uses a lock-free freelist pool, which is faster than malloc but still has CAS overhead.
Timer dispatch without per-pop locking: ✅ Done — Acquire timer_mu once, pop all expired timers into a local list, release the lock, then fire them. Eliminates N lock/unlock cycles for N expired timers.
Timer struct pooling: ✅ Done — Timer structs are recycled via a lock-free freelist (event_timer_alloc() / event_timer_free()), eliminating malloc/free per timer.
Wake coalescing for offload: ✅ Done — An atomic wake_pending flag ensures only the first completing worker performs the actual wake syscall. Subsequent workers see the flag already set and skip the syscall entirely.
Caller-allocated work items: ✅ Done — Work items are pooled via a lock-free Treiber stack (event_work_alloc() / event_work_free()), eliminating per-submit malloc. Equivalent to libuv's zero-alloc model.
Lighter wake mechanism: ✅ Done — kqueue backend uses EVFILT_USER (zero fd, no pipe) for wake; epoll backend uses eventfd (single fd) instead of a pipe pair. Poll backend retains the pipe as a POSIX fallback.