Intro — Chimera

§ 01Framing

Most NAS stacks assume the thing they sit on is an existing filesystem with a POSIX or NTFS-style API. NAS protocols carry more intent than those APIs can express, so anything that doesn’t map cleanly tends to get dropped or approximated.

Chimera is aimed at the other case: filesystems that are willing to implement NAS protocol intent directly. Its VFS is modeled on what the protocols request rather than on POSIX. A backend that can serve a given operation natively does; one that can’t says so, and the VFS supplies a fallback.

The intent is better efficiency, since less work goes into translating intent the backend could have used directly; better correctness, since atomic operations stay atomic, file identities stay stable, and replayed requests stay sound; and better behavior across protocols, since NFS, SMB, and S3 share one VFS instead of running as three adapters with separate caches and locks.

§ 02Asynchronous

POSIX and the UNIX file APIs behind it were designed for disks that took milliseconds to respond. When a read() moved a disk head and waited ten or fifteen milliseconds for the data, blocking the calling thread was a reasonable design. The API stayed simple, and the cost of one idle thread per outstanding request was small next to the I/O itself. That still holds for rotating disks.

Flash changes the balance. An NVMe device can sustain millions of IOPS at tens of microseconds of latency. The seek is gone and the queue is deep, so the limiting cost moves from the device to the work of issuing each operation: the context switch, the scheduler pass, and the kernel wake-up that a blocking call incurs.

Chimera’s VFS is asynchronous throughout. An operation is dispatched and returns immediately, and its result arrives in a callback. A worker thread doesn’t block while other work is available; it stays busy across many in-flight requests and sleeps only when there is nothing left to do. The benefit is thread occupancy more than per-request latency: on a fast backend a blocking call wastes a thread, and a synchronous stack ends up with most of its threads parked at the kernel boundary.

§ 03User Space

A local filesystem lives in the kernel mainly for security. The kernel sits between users and the device and decides what each process may read, write, and own. Permissions, the page cache, and mount handling are all enforced at that boundary.

A NAS server is a different case. When storage is reached over a network, the server is the security boundary, not the kernel on either end. It authenticates the client, decides what the client may do, and returns data accordingly. That enforcement happens at the protocol level, and a server can do it from user space as well as from a kernel module.

Operating systems have historically put NAS stacks in the kernel for proximity to the local filesystems they export, since the path from the kernel VFS to the disk is shortest there. That is a consequence of where the local filesystem already lived, not a security requirement.

For a distributed storage system, user space has practical advantages. Development is faster, a crash is contained to a process rather than the host, and the edit-build-test cycle is a restart rather than a kernel-module rebuild against several kernel versions. It also makes portability across operating systems possible. The security model is unchanged: the server is still the boundary, now running as a process.

Chimera’s position is that, without a specific reason to be in the kernel, a NAS stack should run in user space. The main historical reason to be in the kernel was access to a mature VFS — its operation set, caches, and locking. Chimera provides a comparable VFS in user space, so a user-space storage system can be served over NFS, SMB, and S3 without giving that up.

The same applies to clients. Applications usually reach NAS through the kernel client, because that is what the OS provides. That adds a syscall and context switch per operation, and it complicates deployment: the mount needs privileged kernel configuration, which is awkward on shared hosts and inside containers. An application can instead link a NAS client directly and issue RPCs from its own event loop, which makes NFS or SMB about as easy to use as HTTPS. Client and server share the same async runtime, buffer pool, and zero-copy paths.

§ 04Zero Copy

Network and storage bandwidth in servers have grown much faster than memory bandwidth. A host can move data to and from the network and NVMe at hundreds of gigabits per second, but copying that data through DRAM has not kept pace. This is the memory wall: for a storage server, throughput depends less on how fast the device or NIC delivers data and more on how many times the CPU copies it.

The goal is to copy it as little as possible, ideally not at all. Data arriving from the network should land where the disk can read it directly, and data leaving the disk should pass through memory once on its way to the NIC. The CPU reads the protocol headers and arranges the transfer but doesn’t touch the file data.

This is difficult for two reasons. Without an RDMA side channel, NAS protocols frame file data with segmentation and headers, so the payload in a network buffer isn’t aligned to a page boundary by the time you’d want to hand it to storage. And storage devices have traditionally required page-aligned DMA buffers. Most current NVMe controllers will DMA to and from arbitrary byte offsets, but the legacy APIs in between make that hard to use.

libevpl helps here because it owns the buffer lifecycle. It can use zero-copy TCP paths — NVIDIA XLIO, or io_uring with registered buffers — to marshal and unmarshal NAS RPCs in the buffers behind the NIC’s send and receive queues, so file data is never copied into a separate application buffer. Chimera’s direct NVMe backend, via VFIO, then reads and writes against those same buffers. With suitable hardware the CPU never reads the file data itself: it inspects headers and programs the NIC and SSD to move bytes between memory regions it owns.

§ 05Compound Operations

NFSv4 and SMB both let a client send a compound request: several dependent operations in a single RPC. A common sequence — walk a directory path, open the file at the end, read the first block, close it — can be one round trip instead of four or five. Each round trip saved is a network round trip saved, which matters most for time-to-first-byte on a cold access.

The capability is often only partly used, for two reasons.

The client application usually reaches the network through a file API — open(), read(), close() — and can’t express a whole sequence to its kernel in one call. The kernel client issues one RPC per call, so the wire carries what the API could express, not what the protocol allows.

Even when a full compound reaches the server, a NAS stack on a conventional local filesystem can’t run it as one unit. The local filesystem offers per-operation transactions, so the server runs the operations in sequence, serializing on the filesystem’s locks and paying per-operation latency for each. The cost saved on the wire reappears inside the server.

A compound is not required to be atomic; neither NFSv4 nor SMB promises that. But where the backend can run the sequence as one transaction, the savings are real: one lock acquisition, one metadata pass, one commit. Chimera passes compound operations to the backend as a structured request, so a backend that can plan and execute them together may do so, while others handle them one at a time. The clearest benefit is time-to-first-data on a cold access, the path-walk-open-read sequence that interactive applications do constantly.

§ 06Richer Semantics

Some correctness requirements that NAS protocols place on the server can’t be expressed through a POSIX-shaped backend interface, and as a result they’re often not met, even by stacks that otherwise look correct. Two examples show the pattern.

SMB3 persistent handles require the server to persist some SMB-specific state — handle identity, lease and oplock context, client identity — atomically with the successful open of the file. If the open succeeds but that state is lost, the client’s recovery contract is broken: it expects to restore a handle the server no longer has.

NFSv4 session replay requires the server to persist the results of certain successful non-idempotent operations, so that a retransmitted RPC returns the same result without repeating the operation. If those results aren’t stored with the operation, a replay is either re-applied, corrupting state, or rejected, failing a client that was retrying correctly.

In both cases, protocol-specific metadata has to be committed in the same transaction as the file operation it describes, and a POSIX-shaped interface can’t express that. The usual workaround is to store the protocol state separately and close in time to the operation, which is not the same as atomic.

Chimera makes these metadata items part of the VFS operations. The call that creates a persistent handle carries the handle state; the call that performs a session-tracked operation carries the reply to record. A backend that can commit them with the file operation does; one that can’t declares so, and Chimera takes the conservative path explicitly.

§ 07Coherency

Each NAS protocol has its own mechanism for client caching and cross-client coherency. NFSv3 has POSIX-style file locks. NFSv4 adds open and lock state and delegations, which let a client cache while the server keeps the right to recall. SMB has oplocks, leases, and directory leases, each with its own break and upgrade rules. S3 specifies that PUTs are atomic and GETs are self-consistent, and expects the server to hold to that under concurrent access.

Supporting all of these against the same data is easier when the protocols aren’t separate daemons each keeping its own state. Running them in one stack, on one VFS, over a single lease primitive lets the question of whether a client may cache something be answered in one place.

On capable backends the lease primitive can be implemented in the filesystem itself. In a parallel or distributed filesystem, the filesystem is the authority on leases — it already coordinates state across nodes and is the only layer that can guarantee coherency across them. A separate lease layer on top of such a filesystem can’t, because it doesn’t own the underlying state.

Chimera’s lease primitive takes protocol type, scope, and owner as arguments, and provides a callback for the backend to recall or downgrade a lease when its state changes. A parallel filesystem can implement it directly; a single-node backend uses Chimera’s in-VFS implementation. Either way the protocols share one authority.

§ 08Security

Security on a NAS server has two parts: who the caller is (authentication) and what they may do (access rights on each object). The protocols differ on both. The difference in access-rights models is the one that tends to cause visible problems, because each protocol expresses them differently.

The models in play:

NFSv3 — classic UNIX mode bits: owner, group, other, three permission triples. That is the entire model.
NFSv4 — keeps the mode bits, but adds NFSv4 ACLs: ordered entries, fine-grained rights, inheritance.
POSIX ACLs — a separate extension layered over the mode bits, with their own well-known mask interactions.
Windows ACLs — the model SMB carries on the wire: ordered ACEs, deny-before-allow rules, and inheritance semantics that don’t line up with NFSv4 ACLs even when they look superficially similar.
S3 — its own object-grant model with bucket policy and object ACL, its own principal and condition syntax, and a separate notion of anonymous access.

These aren’t variations on one model. They differ in meaning: what a deny entry does, how inheritance works, whether entry order matters, and whether a principal can hold more than one kind of grant. There is no single mapping that is faithful to all of them.

Supporting them together needs a defined, configurable policy: what an NFSv3 chmod does to an existing Windows ACL, how an SMB ACL appears through an NFSv4 GETATTR, what an S3 PUT does to existing metadata. The right answer depends on the deployment and on which protocol is treated as authoritative for the data. Running every protocol on one VFS lets Chimera apply those rules in one place, with configurable precedence, where the deciding code can see all the views at once.

§ 09Open Source

For all of the above, Chimera is mostly glue. There is a lot of it, because the protocols are large and inconsistent, but it is glue. The value is at the two ends of the channel: the application and the storage system. Chimera’s job is to connect them correctly.

The hard part isn’t the specifications. NFS, SMB, and S3 are well documented and have mature reference implementations. The hard part is behavioral correctness in the field: what a given Windows installer does on STATUS_DELETE_PENDING, how an NFSv4 client handles an inheritance flag that changes between two GETATTR calls, what an editor assumes about the timing of a lease recall. That isn’t in the specs; it’s in deployed clients. In-house testing doesn’t fully cover it; real applications do.

This is why Chimera is LGPL-2.1. A correct NAS implementation is most likely to come from many projects using the same code, each exercising different parts of the protocol surface and fixing what they hit.

The obvious objection is that open-sourcing it gives the work away. That only follows if the NAS stack is where the value is, and it usually isn’t. The differentiation in a storage product is in the distributed system, the data layout, durability, and operations. The protocol layer is something every project at this level has to build, and there is little benefit to each one building it privately.

Why build another NAS stack?