|
The RDMAC and the IBTA have defined so-called verbs
interfaces [VERBS-RDMAC,
IB-R1.2] that
semantically (but not syntactically) describe the interface between
a consumer (application or OS service) and an
iWARP RNIC and InfiniBand HCA, respectively. RNICPI
is a programming interface for RNIC devices compliant with the RDMAC
verbs. The common characteristics of these programming interfaces
include:
 |
Hierarchical organization of RDMA objects |
 |
Asynchronous interfaces with work request / work completion
semantics, supporting multiple outstanding operations |
 |
Explicit communication buffer management by the consumer to
enable zero-copy implementations. A buffer can be registered
for RDMA by creating a memory region (MR).
Alternatively, a buffer can be registered for RDMA by binding
(a.k.a. linking) a memory window (MW) to
a segment of a previously created MR. |
 |
Two-sided Send/Receive communication semantics as well as
one-sided RDMA communication semantics. |
Not surprisingly, these chararacteristics can also be found in
RDMA APIs with explicit communication buffer management such as
IT-API described below.
The IT-API [IT-API-V2.1]
is the first RDMA-capable application programming interface to fully
support the iWARP and InfiniBand 1.2 transports, providing RDMA
services to applications that need high-performance/ zero-copy communications,
explicit memory management semantics for communications buffers
as well as asynchronous interfaces. As shown for iWARP in Fig. 4,
the IT-API is located between consumers and RDMA OS support.
The IT-API is transport neutral for maximum portability, but features
both a transport-independent interface (TII)
and a transport-dependent interface (TDI) for
connection management, where special requirements exist for the
iWARP transport.
The TII is based on the it_ep_connect() interface.
For the iWARP transport, the implementation of it_ep_connect()
immediately transitions the connection to RDMA mode after TCP's
SYN/SYN+ACK/ACK handshake. As described in [IT-API-V2.1],
Appendix B, Implementer's Guide to Connection Management
for iWARP, implementations are possible that will interoperate
with both 'permissive' and 'non-permissive' RNICs [INTEROP-IETF].
The implementation of it_ep_connect() is a good
example for OS-supported functionality that should reside within
generic software layers shared by all RDMA devices, referred to
as the user/kernel Access Layer (uAL/kAL).
The TDI provides the it_socket_convert() socket
conversion interface for iWARP, which allows converting a live TCP
connection to an RDMA-enabled endpoint after exchanging data in
TCP streaming mode. Socket conversion was specifically designed
to support applications such as iSCSI enhancements for RDMA (iSeR)
and Sockets Direct Protocol (SDP). IT-API provides support for interoperability
with RDMAC-compliant RNICs that are unable to perform IETF's MPA
Request/Reply handshake. Socket conversion is another example for
functionality that is best implemented within the generic uAL/kAL
for use on any RNIC.
Fig. 5 shows IT-API's RDMA object hierarchy, which closely resembles
the hierarchical organization of the underlying verbs interfaces.
For instance, an IT-API interface adapter (IA)
corresponds to a logical RNIC (and its associated
RNIC handle) in the verbs interface. Similarly, an endpoint
(EP) corresponds to an underlying queue pair
(QP). On the other hand, an event dispatcher
(EVD) has a corresponding completion queue (CQ)
but additionally provides OS-supported event handling through the
it_evd_wait() and it_evd_callback_*()
calls. Event handling is another example for OS-supported functionality
that should reside within the uAL/kAL.
Based on the RDMAC Verbs [VERBS-RDMAC]
and the InfiniBand Verbs [IB-R1.2],
the RNICPI [RNICPI-V1.0]
defines a generic programming interface for the vendor independent
integration of both iWARP RNICs and InfiniBand HCAs into Unix-like
operating systems and is supported by a majority of OS and RNIC
vendors. RNICPI takes into account requirements specific to the
iWARP transport such as connection setup with immediate or deferred
transition to RDMA mode. As shown for iWARP in Fig. 4, RNICPI is
located between RDMA OS support and the implementation of the iWARP
protocol layers.
Regarding RNICPI's RDMA object hierarchy, the distinction between
Verbs Providers (VPs) and logical
RNICs is worth mentioning because it affects the organization
and identification of RDMA objects. Each VP provides both a kVP
module and a uVP library. A consumer first creates a logical RNIC
within the VP and then creates RDMA objects such as protection domains,
queue pairs etc. within the logical RNIC. A VP advantageously organizes
RDMA objects per logical RNIC.
The following examples simultaneously illustrate the programming
model and the underlying protocol actions for Send/Receive and RDMA
Write operations.
Figs. 6 and 7 show a Send and its matching Receive operation, also
referred to as two-sided communication semantics. In Fig. 6, the
application is assumed to have preregistered two source buffers
(MRs) with STags s1 and s2
and to have posted a Send work request (WR) to endpoint EP1
(more precisely, to its SQ). The implementation gets the WR, transfers
the data directly from buffer s1 to the RNIC,
and forms the Send message. Fig. 6 illustrates a case where one
TCP segment contains multiple DDP segments. When the implementation
can guarantee that data will be delivered reliably, it indicates
completion by posting a work completion to the EVD (CQ). At this
time, control is returned to the consumer waiting on the EVD. For
Fig. 7, the application has already preregistered destination buffers
(MRs) d1 and d2 and posted
a Receive WR to EP2. When the Send message from
EP1 comes in, the implementation gets the WR and transfers the data
directly from the RNIC to buffer d1. In our animation
(Fig. 7) , the DDP segments are placed in order, but the implementation
may perform out-of-order placement of DDP segments. Finally, the
implementation indicates delivery by posting a work completion to
the EVD. At this time, control is returned to the consumer waiting
on the EVD.
Figs. 8 and 9 depict an RDMA Write operation at the data source
and data sink, respectively. Since RDMA operations have one-sided
communication semantics, the consumer at the data sink does not
post a WR for being able to receive. Instead, the consumer at the
data sink preregistered a destination buffer d1
and preadvertised d1 to its peer. The consumer
at the data source posts an RDMA Write WR targeting the destination
buffer with STag d1, which is sent along with
the RDMA Write message. When the incoming RDMA Write has been placed,
delivery is indicated to RDMAP, but no completion is generated at
the data sink.
|