|
The layered RDMA host software architecture shown in Fig. 10, which
cleanly separates generic/OS and verbs-provider-specific software
functionality into user/kernel Access Layer (uAL/kAL)
and user/kernel Verbs Provider (uVP/kVP) components,
respectively, has a number of advantages, including:
| (A1) |
Generic implementation of it_ep_connect()
within uAL/kAL reduces code bloat |
| (A2) |
Generic implementation of it_socket_convert()
within uAL/kAL reduces code bloat |
| (A3) |
Generic event handling provided by uAL/kAL reduces code bloat |
| (A4) |
Memory management extensions for RDMA in kAL provide memory
pinning both at VMA level and page level |
| (A5) |
A single device file for the kAL is sufficient for multi-device/multi-vendor
RDMA system call support including event handling and simplifies
the auditing of VP-specific software |
| (A6) |
OS-wide unique object IDs for identifying RDMA objects can
be managed by the kAL, which simplifies RDMA syscalls |
(A1) to (A3) were discussed in the description of the IT-API.
Memory management (MM) extensions for RDMA (A4) are an excellent
example for generic functionality that belongs to the kAL. For instance,
we have extended Linux MM such that memory is pinned consistently
at both VMA level and page level. This resolves a number of issues
related to pinning of read-only address intervals, handling of copy-on-write
(COW) situations, and overlapping pinnings.
Regarding (A5), the use of a single device file for the kAL simplifies
the auditing of verbs-provider-specific software, since all RDMA
syscalls pass through the uAL as well as the kAL's syscall handler
for parameter validation.
Consider now the creation of an RDMA resource, which typically
consists of a uAL object, a kAL object, and corresponding uVP and
kVP objects. The uAL and kAL objects are used for the OS-wide organization
and identification of RDMA resources and are typically small; the
uVP and kVP objects contain data structures specific for an RNIC
implementation.
Fig. 11 illustrates the creation of an endpoint and its associated
queue pair. An it_ep_rc_create() call (1) into
the uAL results in an ri_qp_create() call (2)
that is passed via NP-RNICPI to the uVP, which in turn calls the
uAL's ri_sys_qp_create() (3). This SYS-RNICPI
upcall identifies the uAL's endpoint context via the os_data
opaque and invokes the kAL-provided syscall (4) for creating an
IT-API endpoint, which passes userspace context information of both
uAL and uVP (for the endpoint and the corresponding queue pair,
respectively) down to the kAL.
As an illustration of (A6), the kAL's syscall handler now generates
an OS-wide unique endpoint ID, which will be used for object identification
in all subsequent kAL syscalls referring to the endpoint - the subsequent
kAL syscalls are simplified by replacing partially redundant verbs
information such as the triple (selected VP, RNIC handle, QP handle)
with a single, OS-wide unique endpoint ID.
Next, the kAL calls the kVP's ri_qp_create()
(5) via P-RNICPI. Optionally, and selectable by the verbs provider,
the kAL can pass the unique endpoint ID to ri_qp_create()
as a replacement for a verbs-provider generated QP handle - using
the same OS-wide unique ID for the endpoint and the corresponding
queue pair simplifies resource management. The kVP's implementation
of ri_qp_create() does an upcall (6) into the
kAL to either map work queues in device memory into userspace, or
to allocate work queues in main memory as dual user/kernel mappings,
i.e., mappings that are simultaneously visible to uVP and kVP. See
SoftRDMA for the use of
dual user/kernel mappings.
It should be noted that the uAL's it_ep_rc_create()
call, upon returning from ri_qp_create() (13),
can easily audit the uVP by checking whether or not it called back
into the uAL as required (A5).
The layered architecture also has a few potential disadvantages:
| (D1) |
On the fast path, an IT-API work request such as it_post_send()
or it_post_rdma_write() must be converted
to a corresponding RNICPI work request, viz., ri_qp_post_send(),
or ri_qp_post_rdma_write(). |
| (D2) |
On the fast path, an RNICPI work completion dequeued with
ri_cq_poll() must be converted to the format
expected by IT-API's it_evd_wait(). |
However, since the work request and work completion formats of
IT-API and RNICPI are similar, an implementation can convert quite
efficiently between the two. Further optimizations are possible,
though, and the IT-API and RNICPI work groups are open for suggestions.
|