WSS Async TLS Offload — Performance Regression Report
This document records the benchmark results after introducing async TLS offload (BIO pair + thread pool) to the OpenSSL backend, compared against the previous synchronous TLS baseline from ws_server.md.
Changes Under Test
The following changes were applied to the OpenSSL TLS transport:
- Async TLS offload: TLS encryption/decryption is offloaded from the event loop thread to a worker thread pool via
xEventLoopSubmit. The event loop thread handles socket I/O and BIO data transfer, while worker threads performSSL_read/SSL_write. - BIO pair transport: Replaced direct
SSL_read(fd)/SSL_write(fd)with a BIO pair architecture:read(fd)→BIO_write(bio_net)→ workerSSL_read→BIO_read(bio_int)→ callback. - xRingBuffer replaces xMemBIO_: In
transport_mbedtls.c, the customxMemBIO_ring buffer was replaced with the sharedxRingBufferfromxbuf/. - xRingBufferWrite semantic change:
xRingBufferWritechanged from all-or-nothing (xErrno) to partial-write (size_t), merging the oldxRingBufferWritePartial.
Test Environment
| Item | Value |
|---|---|
| CPU | Apple M3 Pro (12 cores) |
| Memory | 36 GB |
| OS | macOS 26.4 (Darwin) |
| Compiler | Apple Clang 17.0.0 |
| Build | Release (-O2) |
| TLS Backend | OpenSSL (system) |
| Certificate | RSA 2048-bit, self-signed, TLSv1.3 |
| Load Generator | ws_bench_client.go (gorilla/websocket) |
Results
WSS Echo 64B — Varying Connection Count
| Connections | Sync TLS (baseline) | Async TLS Offload | Δ Throughput | Δ Latency |
|---|---|---|---|---|
| 50 | 186,513 msg/s, 268 μs | 56,737 msg/s, 881 μs | −70% | +229% |
| 100 | 186,068 msg/s, 537 μs | 56,692 msg/s, 1.76 ms | −70% | +228% |
| 200 | 184,066 msg/s, 1.09 ms | 57,223 msg/s, 3.49 ms | −69% | +220% |
| 500 | 167,019 msg/s, 2.99 ms | 55,144 msg/s, 9.06 ms | −67% | +203% |
WSS Echo — Varying Message Size (100 connections)
| Message Size | Sync TLS (baseline) | Async TLS Offload | Δ Throughput |
|---|---|---|---|
| 64 B | 165,952 msg/s | 56,692 msg/s | −66% |
| 256 B | 174,475 msg/s | 54,170 msg/s | −69% |
| 1 KiB | 149,246 msg/s | 54,589 msg/s | −63% |
| 4 KiB | 92,686 msg/s | 51,142 msg/s | −45% |
Transfer Rate (100 connections)
| Message Size | Sync TLS | Async TLS Offload | Δ |
|---|---|---|---|
| 64 B | 20.26 MB/s | 6.92 MB/s | −66% |
| 256 B | 85.19 MB/s | 26.45 MB/s | −69% |
| 1 KiB | 291.50 MB/s | 106.62 MB/s | −63% |
| 4 KiB | 723.95 MB/s | 399.55 MB/s | −45% |
Latency (100 connections, varying message size)
| Message Size | Sync TLS | Async TLS Offload | Δ |
|---|---|---|---|
| 64 B | 537 μs | 1.76 ms | +228% |
| 256 B | — | 1.85 ms | — |
| 1 KiB | — | 1.83 ms | — |
| 4 KiB | — | 1.95 ms | — |
Analysis
Performance is severely degraded
Across all test cases, the async TLS offload shows a 65–70% throughput reduction and 2–3× latency increase compared to the synchronous TLS baseline. The degradation is consistent across connection counts and message sizes.
Root causes
-
Thread pool scheduling overhead dominates small-message TLS cost. For 64-byte messages, AES-GCM encryption/decryption takes on the order of nanoseconds, but each
xEventLoopSubmit→ worker thread → done callback round-trip costs tens of microseconds due to context switching, mutex contention, and cache invalidation. The scheduling overhead is orders of magnitude larger than the crypto work itself. -
Extra data copies through BIO pair. The synchronous path does
SSL_read(fd)directly — one syscall, zero copies between buffers. The async path requires:read(fd)→ memcpy intoxRingBuffer(inbound)→ worker threadSSL_readreads from BIO →BIO_writeoutput → memcpy intoxRingBuffer(outbound)→write(fd). This adds at least 2 extra memcpy operations per message direction. -
Serialization bottleneck not eliminated. The async offload was intended to free the event loop thread from TLS work, but the event loop still must: (a)
read(fd)ciphertext, (b) feed it into the inbound ring buffer, (c) drain the outbound ring buffer, (d)write(fd)ciphertext. The worker thread only does the SSL state machine. For a single-threaded event loop, this splits one thread's work into two threads' serial work (event loop → worker → event loop), adding synchronization overhead without parallelism. -
Throughput ceiling around 57K msg/s. The async path's throughput is remarkably stable across connection counts (55K–57K), suggesting the bottleneck is the per-message offload overhead rather than I/O or crypto. This is consistent with a fixed per-message cost of ~17 μs (1/57K), which matches typical thread pool dispatch latency.
-
4KB messages show the smallest regression (−45%). As message size grows, the crypto cost increases relative to the fixed scheduling overhead, making the offload less wasteful. This confirms that the overhead is per-message, not per-byte.
Comparison with Go goroutine-parallel TLS
For reference, gorilla/websocket achieves ~180K msg/s on WSS with virtually zero TLS overhead compared to plain WS. Go's crypto/tls runs per-goroutine, parallelizing encryption across all CPU cores without the BIO-pair indirection. This is the model that async TLS offload was trying to approximate, but the single event loop + thread pool architecture cannot match it.
Conclusion
The async TLS offload architecture is a net negative for the WSS echo workload. The per-message thread dispatch overhead far exceeds the TLS crypto cost for small-to-medium messages (64B–4KB).
Recommendations
-
Revert to synchronous TLS for the default path. The synchronous
SSL_read(fd)/SSL_write(fd)model is 3× faster for this workload. The event loop thread can handle TLS inline without issue. -
Consider async offload only for large payloads. If async TLS is desired, gate it behind a message-size threshold (e.g., >16KB) where the crypto cost justifies the dispatch overhead.
-
Explore multi-threaded event loops instead. Rather than offloading TLS from a single event loop, run multiple event loop threads (one per core), each handling its own connections with synchronous TLS. This is how Go achieves parallelism — not by offloading crypto, but by running independent I/O loops in parallel.
-
If async TLS is kept, optimize the dispatch path. Reduce per-message overhead by batching multiple SSL operations per dispatch, using lock-free queues, or coalescing small messages before offloading.