performance · §

Performance

Where the runtime currently sits, how the numbers are measured, and the workloads used to measure them.

Note: the numbers on this page are placeholders from development benchmarks on the configuration described below. Treat them as directional. The bench/ tree in the repository contains the harnesses, so you can reproduce them on your own hardware.

// headline numbers

ECHO ROUND-TRIP — RDMA

2.8µs

p50, 64-byte payload, RoCE v2, single core in poll mode.

ECHO ROUND-TRIP — KERNEL TCP

14µs

p50, 64-byte payload, kernel sockets baseline, single core.

SUSTAINED THROUGHPUT

390Gbps

Single-thread RDMA SEND, 1 MiB messages, zero-copy from pool.

SMALL-MSG RATE

14.2M/s

Send + completion notifications, single core poll mode, 256B messages.

BLOCK I/O — IO_URING

2.6M IOPS

4 KiB random read, one loop, NVMe via io_uring, queue depth 64.

EVENT→POLL TRANSITION

< 5µs

Time from first work arriving to loop reaching poll-mode steady state.

// kernel sockets vs accelerators

echo p50 latency · 64 b payload · single core lower is better

kernel tcp

14.0 µs

io_uring tcp

9.6 µs

xlio tcp

4.9 µs

rdma · event

3.6 µs

rdma · poll

2.8 µs

// placeholder values from the development bench rig; expect these to change as the implementation matures.

// methodology

configuration echo-stream / echo-message harnesses, bench/ tree

cpu: 2× AMD EPYC 9354P · 32C / 64T per socket, SMT off in test runs
memory: 512 GiB DDR5-4800, NUMA-pinned to the NIC’s socket
nic — rdma / xlio: NVIDIA ConnectX-7 400 GbE · RoCE v2 · OFED 24.x
nic — kernel sockets: Same NIC, kernel TCP path
kernel: Linux 6.x · CONFIG_PREEMPT_NONE · isolcpus on benchmark cores

nvme: Samsung PM1743 Gen5 NVMe · io_uring backend; VFIO numbers run separately
governance: CPU governor: performance · turbo locked · C-states limited to C1 · IRQ-affinity off benchmark cores
measurement: Bench harness reports raw histograms; p50/p99/p99.9 reported at minimum. Warmup ≥ 30 s, run ≥ 60 s.
harnesses: echo-stream · echo-message · custom block bench in bench/block

// what we test for

Latency under light load. The event-mode loop should sleep and wake on a single completion in microseconds.
Latency under saturating load. The poll-mode loop should land within a small multiple of raw NIC RTT.
Transition behavior. When continuous work appears, the loop should move to poll mode without a latency cliff.
Buffer lifecycle. Registered buffers in and out, with no allocations or re-registrations on the hot path.
Block + network in one loop. Block I/O shouldn’t stall network completions, or the reverse.
Doorbell wake. Cross-loop signaling latency, sleeping versus polling.
Scaling. N loops on N cores should scale roughly linearly until the NIC or storage saturates.

// reproducibility

Every number on this page comes from a harness in the public repository, so anyone with comparable hardware can reproduce it and anyone without can still check the methodology. For help with configuration, open an issue or ask on Discord.

Repository & bench/ tree → Read the architecture Build & install