Event Loop — Benchmark Report

Micro-benchmark comparison of moo's xEventLoop against libuv 1.52.1 across three dimensions: cross-thread wake latency, timer scheduling, and offload round-trip (submit work → done callback on loop thread).

Test Environment

ItemValue
CPUApple M3 Pro (12 cores)
Memory36 GB
OSmacOS 26.4 (Darwin)
CompilerApple Clang 17.0.0
BuildRelease (-O2)
FrameworkGoogle Benchmark
Event Backendkqueue (moo), kqueue (libuv)
Workers4 threads (for offload benchmarks)

Results

Core Operations (moo only)

BenchmarkTime (ns)CPU (ns)Iterations
BM_EventLoop_CreateDestroy700700974,157
BM_EventLoop_WakeLatency4134131,717,088
BM_EventLoop_PipeAddDel1,1441,144612,118
  • Create/Destroy takes ~700ns — reduced from ~2.8µs after eliminating the wake pipe (no more pipe() + two extra fds). Reflects only kqueue fd creation + internal structure allocation.
  • Wake latency is ~413ns per wake+wait cycle via EVFILT_USER, down from ~879ns with the old pipe mechanism — a 2.1× improvement.
  • Add/Del cycle (register + unregister a pipe fd) takes ~1.1µs — low overhead for dynamic fd management.

Wake Latency — moo vs libuv

moolibuvRatio
Time413 ns417 nsmoo 1.01× faster

moo now uses EVFILT_USER on kqueue (macOS) and eventfd on epoll (Linux) for wake notification, replacing the previous pipe-based mechanism. Combined with an atomic wake_pending flag for coalescing, this eliminates all pipe overhead. The result is effectively tied with libuv (413ns vs 417ns), closing the previous 2.1× gap entirely.

Timer Scheduling

moo — Timer

BenchmarkTime (ns)CPU (ns)Throughput
BM_EventLoop_TimerSingle4614612.17M items/s
BM_EventLoop_TimerBatch/1075075013.34M items/s
BM_EventLoop_TimerBatch/1003,7143,71426.93M items/s
BM_EventLoop_TimerBatch/100043,55043,54522.96M items/s

libuv — Timer

BenchmarkTime (ns)CPU (ns)Throughput
BM_Libuv_TimerSingle12,3611,517659.2k items/s
BM_Libuv_TimerBatch/1012,6131,7875.60M items/s
BM_Libuv_TimerBatch/10016,4125,31118.83M items/s
BM_Libuv_TimerBatch/100079,72168,65914.56M items/s

Comparison — Timer (CPU time)

Batch Sizemoo (CPU ns)libuv (CPU ns)Ratio
14611,517moo 3.29× faster
107501,787moo 2.38× faster
1003,7145,311moo 1.43× faster
1,00043,54568,659moo 1.58× faster

Analysis:

  • Single timer — moo wins at ~461ns vs libuv's ~1.5µs (3.3× faster). moo's timer path is simpler: heap push + xEventWait pops and fires in one call. libuv's uv_timer_start + uv_run(UV_RUN_ONCE) has more overhead per invocation.
  • Batch timers — moo now wins across all batch sizes, a dramatic reversal from the previous results where libuv was 4–5× faster. The key optimizations that closed the gap:
    1. Batch pop with single lock: Timer dispatch now acquires timer_mu once, pops all expired timers into a local list, releases the lock, then fires them — eliminating N lock/unlock cycles.
    2. Timer struct freelist: Timer structs are recycled via a lock-free freelist, eliminating malloc/free per timer operation.
    3. Throughput: At batch size 1000, moo achieves 22.96M items/s vs libuv's 14.56M items/s — 1.58× faster.

Offload Round-Trip (Submit → Done Callback)

moo — Offload

BenchmarkTime (ns)CPU (ns)Throughput
BM_EventLoop_OffloadSingle6,4013,785264.2k items/s
BM_EventLoop_OffloadBatch/1014,98912,243816.8k items/s
BM_EventLoop_OffloadBatch/10056,56346,5342.15M items/s
BM_EventLoop_OffloadBatch/1000496,393456,4262.19M items/s

libuv — Offload

BenchmarkTime (ns)CPU (ns)Throughput
BM_Libuv_OffloadSingle5,8433,449290.0k items/s
BM_Libuv_OffloadBatch/1013,90910,239976.7k items/s
BM_Libuv_OffloadBatch/10035,83830,0613.33M items/s
BM_Libuv_OffloadBatch/1000242,694218,5134.58M items/s

Comparison — Offload (CPU time)

Batch Sizemoo (CPU ns)libuv (CPU ns)Ratio
13,7853,449libuv 1.10× faster
1012,24310,239libuv 1.20× faster
10046,53430,061libuv 1.55× faster
1,000456,426218,513libuv 2.09× faster

Analysis:

  • Single offload — Nearly tied (~1.10× gap, narrowed from 1.16×). Both are dominated by the same bottleneck: waking a sleeping worker thread via kernel syscall.
  • Batch offload — libuv remains ~2× faster at scale. The gap has narrowed slightly at smaller batch sizes (1.20× at 10, down from 1.45×) thanks to wake coalescing and work item pooling. The remaining gap is primarily due to:
    1. Completion notification: libuv workers post to an async handle and the loop drains all completions in one uv__work_done() call. moo uses an MPSC queue with atomic wake coalescing.
    2. Allocation model: libuv's uv_work_t is caller-allocated (stack or embedded). moo uses a lock-free freelist pool, which is faster than malloc but still has CAS overhead.

Summary

DimensionBefore OptimizationAfter Optimizationvs libuv
Wake Latency879 ns (libuv 2.1× faster)413 nsTied (moo 1.01× faster)
Timer (single)974 ns (moo 1.6× faster)461 nsmoo 3.3× faster
Timer (batch ×1000)318,805 ns (libuv 4.3× faster)43,545 nsmoo 1.6× faster
Offload (single)4,110 ns (libuv 1.2× faster)3,785 nslibuv 1.1× faster (tied)
Offload (batch ×1000)507,346 ns (libuv 1.95× faster)456,426 nslibuv 2.1× faster

Key Improvements

OptimizationImpact
EVFILT_USER / eventfd wakeWake latency 2.1× faster (879→413ns), closed gap with libuv
Timer batch-pop (single lock)Timer batch/1000 7.3× faster (318µs→43µs), now beats libuv
Timer struct freelistEliminated per-timer malloc, contributes to batch improvement
Work item freelist (Treiber stack)Reduced offload overhead, narrowed gap at small batch sizes
Wake coalescing (atomic flag)Reduced redundant wake syscalls from N to 1 in batch scenarios

Completed Optimizations

  1. Timer dispatch without per-pop locking: ✅ Done — Acquire timer_mu once, pop all expired timers into a local list, release the lock, then fire them. Eliminates N lock/unlock cycles for N expired timers.

  2. Timer struct pooling: ✅ Done — Timer structs are recycled via a lock-free freelist (event_timer_alloc() / event_timer_free()), eliminating malloc/free per timer.

  3. Wake coalescing for offload: ✅ Done — An atomic wake_pending flag ensures only the first completing worker performs the actual wake syscall. Subsequent workers see the flag already set and skip the syscall entirely.

  4. Caller-allocated work items: ✅ Done — Work items are pooled via a lock-free Treiber stack (event_work_alloc() / event_work_free()), eliminating per-submit malloc. Equivalent to libuv's zero-alloc model.

  5. Lighter wake mechanism: ✅ Done — kqueue backend uses EVFILT_USER (zero fd, no pipe) for wake; epoll backend uses eventfd (single fd) instead of a pipe pair. Poll backend retains the pipe as a POSIX fallback.