HTTP Server Benchmark
End-to-end HTTP/1.1 server benchmark comparing xKit (single-threaded event-loop) against Go net/http (goroutine-per-connection).
Test Environment
| Item | Value |
|---|---|
| CPU | Apple M3 Pro (12 cores) |
| Memory | 36 GB |
| OS | macOS 26.4 (Darwin) |
| Compiler | Apple Clang 17.0.0 |
| Build | Release (-O2) |
| Load Generator | wrk — 4 threads, 10s duration |
Server Implementations
xKit (bench/http_bench_server.cpp)
Single-threaded event-loop HTTP/1.1 server built on xbase/event.h + xhttp/server.h. Uses kqueue on macOS, epoll on Linux. All I/O is handled in one thread — no thread pool, no goroutines.
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DXK_BUILD_BENCHMARKS=ON
cmake --build build --parallel
./build/bench/http_bench_server 8080
Go (bench/http_bench_server.go)
Standard net/http server with default settings. Go's runtime spawns one goroutine per connection and uses its own epoll/kqueue poller internally.
go build -o build/bench/go_http_bench bench/http_bench_server.go
./build/bench/go_http_bench 8081
Routes
Both servers implement identical routes:
| Route | Method | Description |
|---|---|---|
/ping | GET | Returns "pong" (4 bytes) — minimal response latency test |
/echo?size=N | GET | Returns N bytes of 'x' — variable response size test |
/echo | POST | Echoes request body — request body throughput test |
Benchmark Methodology
All benchmarks use wrk with the following defaults unless noted:
- 4 threads (
-t4) - 100 connections (
-c100) - 10 seconds (
-d10s)
POST benchmarks use Lua scripts to set the request body:
wrk.method = "POST"
wrk.headers["Content-Type"] = "application/octet-stream"
wrk.body = string.rep("x", BODY_SIZE)
Results
GET /ping — Minimal Response Latency
Tests raw request/response overhead with a 4-byte "pong" response. Varies connection count to measure scalability.
| Connections | xKit Req/s | Go Req/s | xKit Latency | Go Latency | Δ |
|---|---|---|---|---|---|
| 50 | 151,935 | 128,639 | 315 μs | 365 μs | xKit +18% |
| 100 | 152,316 | 128,915 | 658 μs | 761 μs | xKit +18% |
| 200 | 151,007 | 128,162 | 1.33 ms | 1.55 ms | xKit +18% |
| 500 | 155,486 | 125,471 | 3.20 ms | 3.96 ms | xKit +24% |
Analysis:
- xKit maintains ~152K req/s regardless of connection count, showing excellent scalability of the single-threaded event loop.
- Go's throughput slightly degrades at 500 connections due to goroutine scheduling overhead.
- xKit's advantage grows from +18% to +24% as connection count increases — the event loop's O(1) dispatch scales better than goroutine context switching.
GET /echo — Variable Response Size
Tests response serialization throughput with different payload sizes. Fixed at 100 connections.
| Response Size | xKit Req/s | Go Req/s | xKit Latency | Go Latency | Δ |
|---|---|---|---|---|---|
| 64 B | 150,592 | 127,432 | 666 μs | 771 μs | xKit +18% |
| 256 B | 146,487 | 126,907 | 682 μs | 774 μs | xKit +15% |
| 1 KiB | 144,831 | 125,729 | 689 μs | 785 μs | xKit +15% |
| 4 KiB | 141,511 | 91,886 | 707 μs | 1.08 ms | xKit +54% |
Analysis:
- xKit throughput degrades gracefully from 151K to 142K req/s as response size grows from 64B to 4KB — only a 6% drop.
- Go drops sharply at 4KB (92K req/s, -27% from 64B), likely due to
bytes.Repeatallocation pressure and GC overhead. - xKit's largest advantage (+54%) appears at 4KB, where Go's per-request heap allocation becomes the bottleneck.
POST /echo — Request Body Throughput
Tests request body parsing and echo throughput. Fixed at 100 connections.
| Body Size | xKit Req/s | Go Req/s | xKit Transfer/s | Go Transfer/s | Δ |
|---|---|---|---|---|---|
| 1 KiB | 141,495 | 122,584 | 152.35 MB/s | 133.51 MB/s | xKit +15% |
| 4 KiB | 133,935 | 83,512 | 536.60 MB/s | 337.13 MB/s | xKit +60% |
| 16 KiB | 82,231 | 53,828 | 1.26 GB/s | 848.10 MB/s | xKit +53% |
| 64 KiB | 35,908 | 31,124 | 2.20 GB/s | 1.90 GB/s | xKit +15% |
Analysis:
- xKit achieves 2.20 GB/s transfer rate at 64KB body size — impressive for a single-threaded server.
- The largest advantage (+60%) appears at 4KB, consistent with the GET /echo pattern — Go's allocation overhead dominates at medium payload sizes.
- At 64KB, the gap narrows to +15% as both servers become I/O bound (kernel socket buffer management dominates).
Summary
xKit vs Go net/http (Release build)
====================================
GET /ping: xKit +18% ~ +24% (consistent across all concurrency levels)
GET /echo: xKit +15% ~ +54% (advantage grows with response size)
POST /echo: xKit +15% ~ +60% (advantage peaks at medium body sizes)
Peak throughput: xKit 155K req/s (GET /ping, 500 connections)
Peak transfer: xKit 2.20 GB/s (POST /echo, 64KB body)
Key Takeaways:
- xKit wins every scenario. A single-threaded C event loop outperforms Go's multi-goroutine runtime across all request types and payload sizes.
- Scalability. xKit's throughput is nearly flat from 50 to 500 connections. Go degrades under high connection counts due to goroutine scheduling overhead.
- Payload efficiency. xKit's advantage is most pronounced at medium payloads (1–4 KiB) where Go's per-request heap allocation and GC pressure become significant.
- Architecture matters. xKit's single-threaded design eliminates all synchronization overhead. Go pays for goroutine creation, scheduling, and garbage collection on every request.
Reproducing
# Build xKit server
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DXK_BUILD_BENCHMARKS=ON
cmake --build build --parallel
# Build Go server
go build -o build/bench/go_http_bench bench/http_bench_server.go
# Run xKit benchmark
./build/bench/http_bench_server 8080 &
wrk -t4 -c100 -d10s http://127.0.0.1:8080/ping
wrk -t4 -c100 -d10s "http://127.0.0.1:8080/echo?size=64"
wrk -t4 -c100 -d10s "http://127.0.0.1:8080/echo?size=4096"
# POST with lua script
cat > /tmp/post.lua << 'EOF'
wrk.method = "POST"
wrk.headers["Content-Type"] = "application/octet-stream"
wrk.body = string.rep("x", 4096)
EOF
wrk -t4 -c100 -d10s -s /tmp/post.lua http://127.0.0.1:8080/echo
# Run Go benchmark (same wrk commands, different port)
./build/bench/go_http_bench 8081 &
wrk -t4 -c100 -d10s http://127.0.0.1:8081/ping