Architecture · libevpl · chimeraproject

§ 01The event-driven model

In event-driven programming, application logic is structured as a collection of callback functions that execute in response to events: a new connection arrives, data becomes available on a socket, a write completes, a timer expires, another thread signals work to be done.

Rather than blocking threads waiting for I/O to complete, applications register callbacks with an event loop. The loop monitors many I/O sources simultaneously and invokes the right callback as each event occurs. A single thread efficiently handles thousands of concurrent connections without the memory overhead or context switching costs of thread-per-connection.

libevpl uses this model throughout, including for block I/O. Everything is asynchronous; nothing in the API blocks the loop.

§ 02The event loop

The core of libevpl is the event loop, represented by struct evpl, which manages all asynchronous operations within a single thread. Each thread that performs I/O runs its own independent event loop. Within it, the application:

Initiates asynchronous operations (connect, send, receive, timer creation).
Registers callbacks invoked on completion.
Runs the loop, which repeatedly checks for completed events and dispatches.
All application logic executes inside those callbacks.

This eliminates locking on per-connection state — all operations on a given connection happen sequentially in the same thread. Applications can maintain complex state machines without worrying about concurrent access.

→ See the Core API for loop creation and management.

§ 03Hybrid event & poll modes

A fundamental challenge in high-performance I/O is choosing between two competing approaches for detecting completions: system-call event notification and busy polling.

// epoll · kqueue · ...

How: register interest, sleep in a wait syscall, kernel wakes the thread.
Wins: energy efficient — sleeping threads consume nothing.
Wins: scales to thousands of idle connections.
Costs: context switch per wakeup; scheduling latency; syscall overhead per wait.
Costs: events accumulate while the thread sleeps.

// busy polling

How: tight loop reading hardware completion queues; no syscalls.
Wins: completion visible to userspace within nanoseconds.
Wins: no context switches; no scheduler latency.
Costs: 100% of a core, even when idle.
Costs: scalability bounded by core count.

libevpl’s hybrid approach

libevpl combines both, switching between them based on load. The name is a contraction of event and poll.

Light load: the loop waits in a syscall; threads sleep and use little CPU.
Heavy load: work is consistently available, so the loop switches to polling completion queues directly, and latency drops to near-hardware levels.
The transition is automatic, and the thresholds are configurable.

§ 04Memory registration & zero-copy I/O

High-performance backends like RDMA and VFIO require memory registration, which lets hardware access application memory directly without kernel involvement. It’s worth understanding before using libevpl.

Why registration exists

Kernel-based I/O usually copies data between userspace and kernel buffers. Kernel bypass removes those copies, but raises a question: how does the hardware safely reach virtual memory that may be paged out, moved, or freed? Memory registration:

Pins pages so they can’t be swapped or relocated.
Creates DMA mappings translating virtual to physical for the device.
Tracks per-region permissions for DMA access.
Programs the IOMMU when present.

The cost, and the fix

A registration takes hundreds of microseconds, far longer than the I/O it enables at 400 Gbps, so registering per-operation would be too expensive. libevpl keeps pools of pre-registered memory: large buffers registered once at startup, from which applications take chunks for I/O and return them after completion. The registration cost is paid once, not on the hot path.

Zero-copy in practice

Sending data:

Application allocates a buffer from libevpl’s pre-registered pool.
Fills the buffer; submits for transmission.
Hardware reads directly from the buffer via DMA.
Buffer returns to the pool on completion.

Receiving works the same way: buffers are pre-posted to the hardware, which DMAs incoming data into them. No copies between hardware and application.

For kernel-socket backends that don’t require registration, libevpl still uses the same pool mechanism for consistency. Application code does not branch on backend.

→ See the Memory Management API.

§ 05Protocol backends

libevpl exposes a transport-agnostic API. At bind time the application names a transport, usually from a config file or CLI flag rather than hardcoded. The application logic doesn’t need per-backend changes: the same C written against kernel sockets compiles and runs against io_uring or RDMA. Switching backends is a deployment change, not a code change — though it is still a deliberate choice, because each backend has its own requirements (RDMA needs the other side to speak RDMA, XLIO wants a Mellanox NIC, VFIO needs an IOMMU, and so on).

kernel sockets

TCP/UDP via BSD sockets. Works everywhere; the baseline against which everything else is measured.

io_uring

Modern Linux async I/O. Drives both kernel sockets and block devices — the same submission/completion queues, the same buffer pools.

nvidia xlio

Userspace TCP for Mellanox NICs. Same socket-like surface, hardware-accelerated TCP/IP. Drop-in for sockets when the NIC supports it.

rdma cm

Reliable Connection & Unreliable Datagram QPs over RoCE v2. Zero-copy and kernel-bypass, for when latency matters.

// more backends over time; open an issue if you want one.

→ See the Bind API for protocol selection.

§ 06Connection management & steering

In a multi-threaded server, spreading incoming connections across threads matters for load balancing and NUMA placement. libevpl handles this explicitly, through a listener attachment model:

A listener (struct evpl_listener) is created to accept connections.
Individual worker threads explicitly attach to the listener.
When a connection arrives, libevpl selects one of the attached threads.
The connection is accepted directly into that thread’s event loop.
All I/O for the connection happens in the same thread, forever.

Why this matters

Placement control: threads on a NUMA node attach to a listener whose connections will touch memory on that node.
Load balancing: libevpl can steer by current load (today round-robin; smarter policies planned).
Flexibility: different listeners can have different attachment policies; specialized threads can take specialized listeners.
No socket sharing: unlike SO_REUSEPORT, libevpl owns visibility into which threads are busy. It also enables load balancing on backends that don’t support SO_REUSEPORT (or any equivalent).

This matters most for RDMA, where the hardware resources (queue pairs) sit close to specific CPU cores and memory domains.

§ 07Timers · deferrals · doorbells

Three primitives for embedding application logic into the event-driven context.

Timers

Periodic by default; the callback removes itself if a one-shot is needed. Microsecond precision (subject to system timer resolution and loop load). Uses: keepalives, timeouts, periodic stats, background maintenance.

Deferrals

Schedule work to run at the end of the current loop iteration. Same thread, no synchronization. Use them to break up long callbacks, flush batches, avoid deep recursion, or order logic after the current event burst.

Doorbells

Thread-safe cross-loop signaling. Thread A queues a work item and rings thread B’s doorbell; B’s loop wakes, whether it was sleeping in epoll_wait or polling, and its callback drains the queue. Higher overhead than deferrals because of the synchronization, so use it when you actually need to cross loops.

§ 08Block I/O integration

libevpl integrates block-device I/O into the same event loop that handles network I/O. A storage server manages everything in one place: a read from disk, a write to the wire, and a timer all wake the same loop.

Submit read/write requests asynchronously.
Continue processing network events.
When the block op completes, a callback fires.

Backends include io_uring (kernel-mediated async) and VFIO-NVMe (direct userspace NVMe, lower latency and fewer copies). Block buffers come from the same pre-registered pools as network buffers, so zero-copy works end-to-end where the backend supports it.

§ 09Threading model

libevpl follows a strict single-threaded event-loop model:

Each I/O-performing thread runs exactly one event loop.
All operations on objects owned by that loop must occur in that thread.
Multiple threads run independent loops in parallel.
Cross-thread communication uses doorbells or other thread-safe mechanisms.

This is what eliminates fine-grained locking in the hot path. Each loop has exclusive access to its state. Applications scale by adding loops, not by sharing one.

→ See the Threading API.

§ 10Design principles

Asynchronous only

All operations are non-blocking; completion arrives via callback. There is no blocking surface.

Protocol agnostic

A unified API across kernel sockets, io_uring, XLIO, RDMA, VFIO. The application doesn’t branch on backend.

Zero-copy optimized

Pre-registered memory pools enable direct hardware access end-to-end. Even backends that don’t require registration use the same pools.

Hybrid event/poll

Automatic transitions between sleeping in epoll_wait and polling completion queues, balancing energy and latency.

Single-threaded event loops

No locks in the hot path. Loops scale by being added, not by being shared.

Explicit thread attachment

Applications control connection steering for NUMA awareness; load balancing decisions are made by libevpl with full visibility.

→ Next: Performance · Build & install · API reference ↗

// libevpl is under active development; some interfaces may still change. Issues and patches welcome.