HTTP/2 Server Benchmark

End-to-end HTTP/2 (h2c, cleartext) server benchmark comparing xKit (single-threaded event-loop) against Go net/http + x/net/http2/h2c (goroutine-per-connection).

Test Environment

ItemValue
CPUApple M3 Pro (12 cores)
Memory36 GB
OSmacOS 26.4 (Darwin)
CompilerApple Clang 17.0.0
BuildRelease (-O2)
Load Generatorh2load (nghttp2 1.68.1) — 4 threads, 10s duration, 10 max concurrent streams per connection

Server Implementations

xKit (bench/http_bench_server.cpp)

Single-threaded event-loop HTTP/2 server built on xbase/event.h + xhttp/server.h. Supports h2c (cleartext HTTP/2) via Prior Knowledge — the same binary as the HTTP/1.1 benchmark, since xKit auto-detects the protocol on the first bytes of each connection.

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DXK_BUILD_BENCHMARKS=ON
cmake --build build --parallel
./build/bench/http_bench_server 8080

Go (bench/h2c_bench_server.go)

Standard net/http server wrapped with golang.org/x/net/http2/h2c.NewHandler() to support cleartext HTTP/2 via Prior Knowledge. Go's runtime spawns one goroutine per connection and uses its own epoll/kqueue poller internally.

cd bench && go build -o ../build/bench/go_h2c_bench h2c_bench_server.go
./build/bench/go_h2c_bench 8081

Routes

Both servers implement identical routes:

RouteMethodDescription
/pingGETReturns "pong" (4 bytes) — minimal response latency test
/echo?size=NGETReturns N bytes of 'x' — variable response size test
/echoPOSTEchoes request body — request body throughput test

Benchmark Methodology

All benchmarks use h2load with the following defaults unless noted:

  • 4 threads (-t4)
  • 100 connections (-c100)
  • 10 max concurrent streams per connection (-m10)
  • 10 seconds (-D 10)

POST benchmarks use -d <file> to specify the request body.

Why h2load? Unlike wrk (HTTP/1.1 only), h2load is purpose-built for HTTP/2 benchmarking. It supports stream multiplexing (-m), h2c Prior Knowledge, and reports per-stream latency.

Results

GET /ping — Minimal Response Latency

Tests raw request/response overhead with a 4-byte "pong" response. Varies connection count to measure scalability under HTTP/2 multiplexing.

ConnectionsxKit Req/sGo Req/sxKit LatencyGo LatencyΔ
50576,249141,655863 μs3.51 msxKit +307%
100561,825120,7321.78 ms8.27 msxKit +365%
200555,800110,1433.59 ms18.10 msxKit +405%
500538,905136,7199.22 ms36.21 msxKit +294%

Analysis:

  • xKit sustains ~560K req/s across all connection counts — a massive improvement over its HTTP/1.1 numbers (~152K) thanks to HTTP/2 stream multiplexing on fewer TCP connections.
  • Go's h2c throughput (~110–142K) is comparable to its HTTP/1.1 numbers, suggesting Go's HTTP/2 implementation doesn't benefit as much from multiplexing.
  • xKit's advantage ranges from +294% to +405% — far larger than the +18–24% gap seen in HTTP/1.1. The single-threaded event loop excels at handling multiplexed streams without context-switching overhead.
  • At 200 connections, xKit's advantage peaks at +405%. Go's throughput degrades more steeply under high connection counts due to goroutine scheduling and HTTP/2 flow control overhead.

GET /echo — Variable Response Size

Tests response serialization throughput with different payload sizes under HTTP/2 framing. Fixed at 100 connections.

Response SizexKit Req/sGo Req/sxKit LatencyGo LatencyΔ
64 B518,176123,3861.92 ms8.08 msxKit +320%
256 B511,276116,2671.95 ms8.60 msxKit +340%
1 KiB493,405115,2672.03 ms8.64 msxKit +328%
4 KiB383,507107,4572.59 ms9.23 msxKit +257%

Analysis:

  • xKit throughput degrades gracefully from 518K to 384K req/s as response size grows from 64B to 4KB — a 26% drop, mostly due to HTTP/2 DATA frame serialization overhead.
  • Go stays relatively flat (~107–123K) but at a much lower baseline. The bytes.Repeat allocation + GC pressure is compounded by HTTP/2 framing overhead.
  • xKit's advantage is consistently +257% to +340% — HTTP/2's HPACK header compression and binary framing amplify xKit's architectural advantage over Go.

POST /echo — Request Body Throughput

Tests request body parsing and echo throughput under HTTP/2. Fixed at 100 connections.

Body SizexKit Req/sGo Req/sxKit Transfer/sGo Transfer/sΔ
1 KiB401,047119,739399.45 MB/s119.82 MB/sxKit +235%
4 KiB195,22190,585766.61 MB/s356.84 MB/sxKit +115%
16 KiB57,30441,313896.83 MB/s648.24 MB/sxKit +39%
64 KiB19,04016,5571.16 GB/s1.01 GB/sxKit +15%

Analysis:

  • xKit achieves 1.16 GB/s transfer rate at 64KB body size — comparable to its HTTP/1.1 performance (2.20 GB/s), with the difference attributable to HTTP/2 flow control and framing overhead.
  • The advantage narrows from +235% (1KB) to +15% (64KB) as both servers become I/O bound. HTTP/2 flow control (default 64KB window) becomes the bottleneck at large payloads.
  • At small payloads (1KB), xKit's +235% advantage shows the efficiency of its nghttp2-based H2 implementation vs Go's x/net/http2.

HTTP/2 vs HTTP/1.1 Comparison

How does HTTP/2 compare to HTTP/1.1 for each server? (GET /ping, 100 connections)

ServerHTTP/1.1 Req/sHTTP/2 Req/sΔ
xKit152,316561,825+269%
Go128,915120,732−6%

Key Insight: xKit's single-threaded event loop benefits enormously from HTTP/2 multiplexing — handling multiple streams on fewer connections eliminates per-connection overhead. Go's goroutine-per-connection model doesn't gain from multiplexing because it already handles concurrency at the goroutine level; the added HTTP/2 framing overhead actually causes a slight regression.

Summary

                    xKit vs Go h2c (Release build, h2load -m10)
                    =============================================

  GET /ping:     xKit +294% ~ +405%   (massive advantage across all concurrency)
  GET /echo:     xKit +257% ~ +340%   (consistent across all response sizes)
  POST /echo:    xKit +15%  ~ +235%   (advantage narrows as payloads grow)

  Peak throughput:  xKit 576K req/s  (GET /ping, 50 connections)
  Peak transfer:    xKit 1.16 GB/s   (POST /echo, 64KB body)

Key Takeaways:

  1. HTTP/2 amplifies xKit's advantage. The gap widens from +18–24% (HTTP/1.1) to +294–405% (HTTP/2) on GET /ping. Stream multiplexing plays to the strengths of a single-threaded event loop.
  2. xKit scales with multiplexing. xKit's throughput jumps from 152K (HTTP/1.1) to 576K (HTTP/2) req/s — a 3.8× improvement. Go's throughput stays flat or slightly regresses.
  3. Payload efficiency. At small-to-medium payloads, xKit's nghttp2-based H2 implementation is dramatically faster. At large payloads (64KB), both servers converge as I/O and flow control dominate.
  4. Architecture matters even more for H2. HTTP/2's stream multiplexing, HPACK compression, and flow control add complexity that a lean C event loop handles more efficiently than Go's runtime.

Reproducing

# Build xKit server
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DXK_BUILD_BENCHMARKS=ON
cmake --build build --parallel

# Build Go h2c server
cd bench && go build -o ../build/bench/go_h2c_bench h2c_bench_server.go && cd ..

# Install h2load (macOS)
brew install nghttp2

# Start servers
./build/bench/http_bench_server 8080 &
./build/bench/go_h2c_bench 8081 &

# GET /ping benchmark
h2load -t4 -c100 -m10 -D 10 http://127.0.0.1:8080/ping
h2load -t4 -c100 -m10 -D 10 http://127.0.0.1:8081/ping

# GET /echo benchmark
h2load -t4 -c100 -m10 -D 10 "http://127.0.0.1:8080/echo?size=1024"
h2load -t4 -c100 -m10 -D 10 "http://127.0.0.1:8081/echo?size=1024"

# POST /echo benchmark (create body file first)
dd if=/dev/zero bs=4096 count=1 | tr '\0' 'x' > /tmp/body_4k.bin
h2load -t4 -c100 -m10 -D 10 -d /tmp/body_4k.bin http://127.0.0.1:8080/echo
h2load -t4 -c100 -m10 -D 10 -d /tmp/body_4k.bin http://127.0.0.1:8081/echo

# Cleanup
pkill -f http_bench_server
pkill -f go_h2c_bench