HTTP Server Benchmark

End-to-end HTTP/1.1 server benchmark comparing xKit (single-threaded event-loop) against Go net/http (goroutine-per-connection).

Test Environment

ItemValue
CPUApple M3 Pro (12 cores)
Memory36 GB
OSmacOS 26.4 (Darwin)
CompilerApple Clang 17.0.0
BuildRelease (-O2)
Load Generatorwrk — 4 threads, 10s duration

Server Implementations

xKit (bench/http_bench_server.cpp)

Single-threaded event-loop HTTP/1.1 server built on xbase/event.h + xhttp/server.h. Uses kqueue on macOS, epoll on Linux. All I/O is handled in one thread — no thread pool, no goroutines.

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DXK_BUILD_BENCHMARKS=ON
cmake --build build --parallel
./build/bench/http_bench_server 8080

Go (bench/http_bench_server.go)

Standard net/http server with default settings. Go's runtime spawns one goroutine per connection and uses its own epoll/kqueue poller internally.

go build -o build/bench/go_http_bench bench/http_bench_server.go
./build/bench/go_http_bench 8081

Routes

Both servers implement identical routes:

RouteMethodDescription
/pingGETReturns "pong" (4 bytes) — minimal response latency test
/echo?size=NGETReturns N bytes of 'x' — variable response size test
/echoPOSTEchoes request body — request body throughput test

Benchmark Methodology

All benchmarks use wrk with the following defaults unless noted:

  • 4 threads (-t4)
  • 100 connections (-c100)
  • 10 seconds (-d10s)

POST benchmarks use Lua scripts to set the request body:

wrk.method = "POST"
wrk.headers["Content-Type"] = "application/octet-stream"
wrk.body = string.rep("x", BODY_SIZE)

Results

GET /ping — Minimal Response Latency

Tests raw request/response overhead with a 4-byte "pong" response. Varies connection count to measure scalability.

ConnectionsxKit Req/sGo Req/sxKit LatencyGo LatencyΔ
50151,935128,639315 μs365 μsxKit +18%
100152,316128,915658 μs761 μsxKit +18%
200151,007128,1621.33 ms1.55 msxKit +18%
500155,486125,4713.20 ms3.96 msxKit +24%

Analysis:

  • xKit maintains ~152K req/s regardless of connection count, showing excellent scalability of the single-threaded event loop.
  • Go's throughput slightly degrades at 500 connections due to goroutine scheduling overhead.
  • xKit's advantage grows from +18% to +24% as connection count increases — the event loop's O(1) dispatch scales better than goroutine context switching.

GET /echo — Variable Response Size

Tests response serialization throughput with different payload sizes. Fixed at 100 connections.

Response SizexKit Req/sGo Req/sxKit LatencyGo LatencyΔ
64 B150,592127,432666 μs771 μsxKit +18%
256 B146,487126,907682 μs774 μsxKit +15%
1 KiB144,831125,729689 μs785 μsxKit +15%
4 KiB141,51191,886707 μs1.08 msxKit +54%

Analysis:

  • xKit throughput degrades gracefully from 151K to 142K req/s as response size grows from 64B to 4KB — only a 6% drop.
  • Go drops sharply at 4KB (92K req/s, -27% from 64B), likely due to bytes.Repeat allocation pressure and GC overhead.
  • xKit's largest advantage (+54%) appears at 4KB, where Go's per-request heap allocation becomes the bottleneck.

POST /echo — Request Body Throughput

Tests request body parsing and echo throughput. Fixed at 100 connections.

Body SizexKit Req/sGo Req/sxKit Transfer/sGo Transfer/sΔ
1 KiB141,495122,584152.35 MB/s133.51 MB/sxKit +15%
4 KiB133,93583,512536.60 MB/s337.13 MB/sxKit +60%
16 KiB82,23153,8281.26 GB/s848.10 MB/sxKit +53%
64 KiB35,90831,1242.20 GB/s1.90 GB/sxKit +15%

Analysis:

  • xKit achieves 2.20 GB/s transfer rate at 64KB body size — impressive for a single-threaded server.
  • The largest advantage (+60%) appears at 4KB, consistent with the GET /echo pattern — Go's allocation overhead dominates at medium payload sizes.
  • At 64KB, the gap narrows to +15% as both servers become I/O bound (kernel socket buffer management dominates).

Summary

                    xKit vs Go net/http (Release build)
                    ====================================

  GET /ping:     xKit +18% ~ +24%   (consistent across all concurrency levels)
  GET /echo:     xKit +15% ~ +54%   (advantage grows with response size)
  POST /echo:    xKit +15% ~ +60%   (advantage peaks at medium body sizes)

  Peak throughput:  xKit 155K req/s (GET /ping, 500 connections)
  Peak transfer:    xKit 2.20 GB/s  (POST /echo, 64KB body)

Key Takeaways:

  1. xKit wins every scenario. A single-threaded C event loop outperforms Go's multi-goroutine runtime across all request types and payload sizes.
  2. Scalability. xKit's throughput is nearly flat from 50 to 500 connections. Go degrades under high connection counts due to goroutine scheduling overhead.
  3. Payload efficiency. xKit's advantage is most pronounced at medium payloads (1–4 KiB) where Go's per-request heap allocation and GC pressure become significant.
  4. Architecture matters. xKit's single-threaded design eliminates all synchronization overhead. Go pays for goroutine creation, scheduling, and garbage collection on every request.

Reproducing

# Build xKit server
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DXK_BUILD_BENCHMARKS=ON
cmake --build build --parallel

# Build Go server
go build -o build/bench/go_http_bench bench/http_bench_server.go

# Run xKit benchmark
./build/bench/http_bench_server 8080 &
wrk -t4 -c100 -d10s http://127.0.0.1:8080/ping
wrk -t4 -c100 -d10s "http://127.0.0.1:8080/echo?size=64"
wrk -t4 -c100 -d10s "http://127.0.0.1:8080/echo?size=4096"

# POST with lua script
cat > /tmp/post.lua << 'EOF'
wrk.method = "POST"
wrk.headers["Content-Type"] = "application/octet-stream"
wrk.body = string.rep("x", 4096)
EOF
wrk -t4 -c100 -d10s -s /tmp/post.lua http://127.0.0.1:8080/echo

# Run Go benchmark (same wrk commands, different port)
./build/bench/go_http_bench 8081 &
wrk -t4 -c100 -d10s http://127.0.0.1:8081/ping