Layered software architecture

A well-layered RDMA host software architecture is shown in Fig. 1, which assumes the use of IT-API for applications and of a Verbs-compatible interface for RNIC drivers. The clean separation of generic/OS and verbs-provider-specific software functionality into user/kernel Access Layer (uAL/kAL) and user/kernel Verbs Provider (uVP/kVP) components, respectively, has a number of advantages, including:

(A1) Generic implementation of it_ep_connect() within uAL/kAL reduces code bloat
(A2) Generic implementation of it_socket_convert() within uAL/kAL reduces code bloat
(A3) Generic event handling provided by uAL/kAL reduces code bloat
(A4) Memory management extensions for RDMA in kAL provide memory pinning both at VMA level and page level
(A5) A single device file for the kAL is sufficient for multi-device/multi-vendor RDMA system call support including event handling and simplifies the auditing of VP-specific software
(A6) OS-wide unique object IDs for identifying RDMA objects can be managed by the kAL, which simplifies RDMA syscalls

(A1) to (A3) were discussed in the above description of the IT-API.

Memory management (MM) extensions for RDMA (A4) are an excellent example for generic functionality that belongs to the kAL. For instance, we have extended Linux MM such that memory is pinned consistently at both VMA level and page level. This resolves a number of issues related to pinning of read-only address intervals, handling of copy-on-write (COW) situations, and overlapping pinnings.

Regarding (A5), the use of a single device file for the kAL simplifies the auditing of verbs-provider-specific software, since all RDMA syscalls pass through the uAL as well as the kAL’s syscall handler for parameter validation.

Consider now the creation of an RDMA resource, which typically consists of a uAL object, a kAL object, and corresponding uVP and kVP objects. The uAL and kAL objects are used for the OS-wide organization and identification of RDMA resources and are typically small; the uVP and kVP objects contain data structures specific for an RNIC implementation.

Figure 2 illustrates the creation of an endpoint and its associated queue pair. An it_ep_rc_create() call (1) into the uAL results in an ri_qp_create() call (2) that is passed via NP-RNICPI to the uVP, which in turn calls the uAL’s ri_sys_qp_create() (3). This SYS-RNICPI upcall identifies the uAL’s endpoint context via the os_data opaque and invokes the kAL-provided syscall (4) for creating an IT-API endpoint, which passes userspace context information of both uAL and uVP (for the endpoint and the corresponding queue pair, respectively) down to the kAL.

As an illustration of (A6), the kAL’s syscall handler now generates an OS-wide unique endpoint ID, which will be used for object identification in all subsequent kAL syscalls referring to the endpoint - the subsequent kAL syscalls are simplified by replacing partially redundant verbs information such as the triple (selected VP, RNIC handle, QP handle) with a single, OS-wide unique endpoint ID.

Next, the kAL calls the kVP’s ri_qp_create() (5) via P-RNICPI. Optionally, and selectable by the verbs provider, the kAL can pass the unique endpoint ID to ri_qp_create() as a replacement for a verbs-provider generated QP handle – using the same OS-wide unique ID for the endpoint and the corresponding queue pair simplifies resource management. The kVP’s implementation of ri_qp_create() does an upcall (6) into the kAL to either map work queues in device memory into userspace, or to allocate work queues in main memory as dual user/kernel mappings, i.e., mappings that are simultaneously visible to uVP and kVP.

It should be noted that the uAL’s it_ep_rc_create() call, upon returning from ri_qp_create() (13), can easily audit the uVP by checking whether or not it called back into the uAL as required (A5).

The layered architecture also has a few potential disadvantages:

(D1) On the fast path, an IT-API work request such as it_post_send() or it_post_rdma_write() must be converted to a corresponding RNICPI work request, viz., ri_qp_post_send(), or ri_qp_post_rdma_write().
(D2) On the fast path, an RNICPI work completion dequeued with ri_cq_poll() must be converted to the format expected by IT-API’s it_evd_wait().

However, since the work request and work completion formats of IT-API and RNICPI are similar, an implementation can convert quite efficiently between the two. Further optimizations are possible, though, and the IT-API and RNICPI work groups are open for suggestions.