HTTP/2 Server Benchmark

End-to-end HTTP/2 (h2c, cleartext) server benchmark comparing xKit (single-threaded event-loop) against Go net/http + x/net/http2/h2c (goroutine-per-connection).

Test Environment

Item	Value
CPU	Apple M3 Pro (12 cores)
Memory	36 GB
OS	macOS 26.4 (Darwin)
Compiler	Apple Clang 17.0.0
Build	Release (`-O2`)
Load Generator	h2load (nghttp2 1.68.1) — 4 threads, 10s duration, 10 max concurrent streams per connection

Single-threaded event-loop HTTP/2 server built on xbase/event.h + xhttp/server.h. Supports h2c (cleartext HTTP/2) via Prior Knowledge — the same binary as the HTTP/1.1 benchmark, since xKit auto-detects the protocol on the first bytes of each connection.

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DXK_BUILD_BENCHMARKS=ON
cmake --build build --parallel
./build/bench/http_bench_server 8080

Go (`bench/h2c_bench_server.go`)

Standard net/http server wrapped with golang.org/x/net/http2/h2c.NewHandler() to support cleartext HTTP/2 via Prior Knowledge. Go's runtime spawns one goroutine per connection and uses its own epoll/kqueue poller internally.

cd bench && go build -o ../build/bench/go_h2c_bench h2c_bench_server.go
./build/bench/go_h2c_bench 8081

Routes

Both servers implement identical routes:

Route	Method	Description
`/ping`	GET	Returns `"pong"` (4 bytes) — minimal response latency test
`/echo?size=N`	GET	Returns N bytes of `'x'` — variable response size test
`/echo`	POST	Echoes request body — request body throughput test

Benchmark Methodology

All benchmarks use h2load with the following defaults unless noted:

4 threads (-t4)
100 connections (-c100)
10 max concurrent streams per connection (-m10)
10 seconds (-D 10)

POST benchmarks use -d <file> to specify the request body.

Why h2load? Unlike wrk (HTTP/1.1 only), h2load is purpose-built for HTTP/2 benchmarking. It supports stream multiplexing (-m), h2c Prior Knowledge, and reports per-stream latency.

Results

GET /ping — Minimal Response Latency

Tests raw request/response overhead with a 4-byte "pong" response. Varies connection count to measure scalability under HTTP/2 multiplexing.

Connections	xKit Req/s	Go Req/s	xKit Latency	Go Latency	Δ
50	576,249	141,655	863 μs	3.51 ms	xKit +307%
100	561,825	120,732	1.78 ms	8.27 ms	xKit +365%
200	555,800	110,143	3.59 ms	18.10 ms	xKit +405%
500	538,905	136,719	9.22 ms	36.21 ms	xKit +294%

Analysis:

xKit sustains ~560K req/s across all connection counts — a massive improvement over its HTTP/1.1 numbers (~152K) thanks to HTTP/2 stream multiplexing on fewer TCP connections.
Go's h2c throughput (~110–142K) is comparable to its HTTP/1.1 numbers, suggesting Go's HTTP/2 implementation doesn't benefit as much from multiplexing.
xKit's advantage ranges from +294% to +405% — far larger than the +18–24% gap seen in HTTP/1.1. The single-threaded event loop excels at handling multiplexed streams without context-switching overhead.
At 200 connections, xKit's advantage peaks at +405%. Go's throughput degrades more steeply under high connection counts due to goroutine scheduling and HTTP/2 flow control overhead.

GET /echo — Variable Response Size

Tests response serialization throughput with different payload sizes under HTTP/2 framing. Fixed at 100 connections.

Response Size	xKit Req/s	Go Req/s	xKit Latency	Go Latency	Δ
64 B	518,176	123,386	1.92 ms	8.08 ms	xKit +320%
256 B	511,276	116,267	1.95 ms	8.60 ms	xKit +340%
1 KiB	493,405	115,267	2.03 ms	8.64 ms	xKit +328%
4 KiB	383,507	107,457	2.59 ms	9.23 ms	xKit +257%

Analysis:

xKit throughput degrades gracefully from 518K to 384K req/s as response size grows from 64B to 4KB — a 26% drop, mostly due to HTTP/2 DATA frame serialization overhead.
Go stays relatively flat (~107–123K) but at a much lower baseline. The bytes.Repeat allocation + GC pressure is compounded by HTTP/2 framing overhead.
xKit's advantage is consistently +257% to +340% — HTTP/2's HPACK header compression and binary framing amplify xKit's architectural advantage over Go.

POST /echo — Request Body Throughput

Tests request body parsing and echo throughput under HTTP/2. Fixed at 100 connections.

Body Size	xKit Req/s	Go Req/s	xKit Transfer/s	Go Transfer/s	Δ
1 KiB	401,047	119,739	399.45 MB/s	119.82 MB/s	xKit +235%
4 KiB	195,221	90,585	766.61 MB/s	356.84 MB/s	xKit +115%
16 KiB	57,304	41,313	896.83 MB/s	648.24 MB/s	xKit +39%
64 KiB	19,040	16,557	1.16 GB/s	1.01 GB/s	xKit +15%

Analysis:

xKit achieves 1.16 GB/s transfer rate at 64KB body size — comparable to its HTTP/1.1 performance (2.20 GB/s), with the difference attributable to HTTP/2 flow control and framing overhead.
The advantage narrows from +235% (1KB) to +15% (64KB) as both servers become I/O bound. HTTP/2 flow control (default 64KB window) becomes the bottleneck at large payloads.
At small payloads (1KB), xKit's +235% advantage shows the efficiency of its nghttp2-based H2 implementation vs Go's x/net/http2.

HTTP/2 vs HTTP/1.1 Comparison

How does HTTP/2 compare to HTTP/1.1 for each server? (GET /ping, 100 connections)

Server	HTTP/1.1 Req/s	HTTP/2 Req/s	Δ
xKit	152,316	561,825	+269%
Go	128,915	120,732	−6%

Key Insight: xKit's single-threaded event loop benefits enormously from HTTP/2 multiplexing — handling multiple streams on fewer connections eliminates per-connection overhead. Go's goroutine-per-connection model doesn't gain from multiplexing because it already handles concurrency at the goroutine level; the added HTTP/2 framing overhead actually causes a slight regression.

Summary

                    xKit vs Go h2c (Release build, h2load -m10)
                    =============================================

  GET /ping:     xKit +294% ~ +405%   (massive advantage across all concurrency)
  GET /echo:     xKit +257% ~ +340%   (consistent across all response sizes)
  POST /echo:    xKit +15%  ~ +235%   (advantage narrows as payloads grow)

  Peak throughput:  xKit 576K req/s  (GET /ping, 50 connections)
  Peak transfer:    xKit 1.16 GB/s   (POST /echo, 64KB body)

Key Takeaways:

HTTP/2 amplifies xKit's advantage. The gap widens from +18–24% (HTTP/1.1) to +294–405% (HTTP/2) on GET /ping. Stream multiplexing plays to the strengths of a single-threaded event loop.
xKit scales with multiplexing. xKit's throughput jumps from 152K (HTTP/1.1) to 576K (HTTP/2) req/s — a 3.8× improvement. Go's throughput stays flat or slightly regresses.
Payload efficiency. At small-to-medium payloads, xKit's nghttp2-based H2 implementation is dramatically faster. At large payloads (64KB), both servers converge as I/O and flow control dominate.
Architecture matters even more for H2. HTTP/2's stream multiplexing, HPACK compression, and flow control add complexity that a lean C event loop handles more efficiently than Go's runtime.

Reproducing

# Build xKit server
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DXK_BUILD_BENCHMARKS=ON
cmake --build build --parallel

# Build Go h2c server
cd bench && go build -o ../build/bench/go_h2c_bench h2c_bench_server.go && cd ..

# Install h2load (macOS)
brew install nghttp2

# Start servers
./build/bench/http_bench_server 8080 &
./build/bench/go_h2c_bench 8081 &

# GET /ping benchmark
h2load -t4 -c100 -m10 -D 10 http://127.0.0.1:8080/ping
h2load -t4 -c100 -m10 -D 10 http://127.0.0.1:8081/ping

# GET /echo benchmark
h2load -t4 -c100 -m10 -D 10 "http://127.0.0.1:8080/echo?size=1024"
h2load -t4 -c100 -m10 -D 10 "http://127.0.0.1:8081/echo?size=1024"

# POST /echo benchmark (create body file first)
dd if=/dev/zero bs=4096 count=1 | tr '\0' 'x' > /tmp/body_4k.bin
h2load -t4 -c100 -m10 -D 10 -d /tmp/body_4k.bin http://127.0.0.1:8080/echo
h2load -t4 -c100 -m10 -D 10 -d /tmp/body_4k.bin http://127.0.0.1:8081/echo

# Cleanup
pkill -f http_bench_server
pkill -f go_h2c_bench

xKit Documentation

HTTP/2 Server Benchmark

Test Environment

Server Implementations

xKit (`bench/http_bench_server.cpp`)

Go (`bench/h2c_bench_server.go`)

Routes

Benchmark Methodology

Results

GET /ping — Minimal Response Latency

GET /echo — Variable Response Size

POST /echo — Request Body Throughput

HTTP/2 vs HTTP/1.1 Comparison

Summary

Reproducing