Programming interfaces for RDMA

Common characteristics of RDMA programming interfaces

The RDMAC and the IBTA have defined so-called verbs interfaces  [VERBS-RDMAC] [IB-R1.2] that semantically (but not syntactically) describe the interface between a consumer (application or OS service) and an iWARP RNIC and InfiniBand HCA, respectively. RNICPI is a programming interface for RNIC devices compliant with the RDMAC verbs [RNICPI]. The common characteristics of these programming interfaces include:

  • Hierarchical organization of RDMA objects
  • Asynchronous interfaces with work request / work completion semantics, supporting multiple outstanding operations
  • Explicit communication buffer management by the consumer to enable zero-copy implementations. A buffer can be registered for RDMA by creating a memory region (MR). Alternatively, a buffer can be registered for RDMA by binding (a.k.a. linking) a memory window (MW) to a segment of a previously created MR.
  • Two-sided Send/Receive communication semantics, as well as one-sided RDMA communication semantics

Not surprisingly, these chararacteristics can also be found in RDMA APIs with explicit communication buffer management such as IT-API described below.

Interconnect Transport API (IT-API)

The IT-API [IT-API-V2.1] is the first RDMA-capable application programming interface to fully support the iWARP and InfiniBand 1.2 transports, providing RDMA services to applications that need high-performance/ zero-copy communications, explicit memory management semantics for communications buffers as well as asynchronous interfaces. As shown for iWARP in Fig. 4, the IT-API is located between consumers and RDMA OS support.

The IT-API is transport neutral for maximum portability, but features both a transport-independent interface (TII) and a transport-dependent interface (TDI) for connection management, where special requirements exist for the iWARP transport.

The TII is based on the it_ep_connect() interface. For the iWARP transport, the implementation of it_ep_connect() immediately transitions the connection to RDMA mode after TCP’s SYN/SYN+ACK/ACK handshake. As described in [IT-API-V2.1], Appendix B, Implementer’s Guide to Connection Management for iWARP, implementations are possible that will interoperate with both ‘permissive’ and ‘non-permissive’ RNICs [INTEROP-IETF]. The implementation of it_ep_connect() is a good example for OS-supported functionality that should reside within generic software layers shared by all RDMA devices, referred to as the user/kernel Access Layer (uAL/kAL).

The TDI provides the it_socket_convert() socket conversion interface for iWARP, which allows converting a live TCP connection to an RDMA-enabled endpoint after exchanging data in TCP streaming mode. Socket conversion was specifically designed to support applications such as iSCSI enhancements for RDMA (iSeR) and Sockets Direct Protocol (SDP). IT-API provides support for interoperability with RDMAC-compliant RNICs that are unable to perform IETF’s MPA Request/Reply handshake. Socket conversion is another example for functionality that is best implemented within the generic uAL/kAL for use on any RNIC.

Figure 1 shows IT-API’s RDMA object hierarchy, which closely resembles the hierarchical organization of the underlying verbs interfaces. For instance, an IT-API interface adapter (IA) corresponds to a logical RNIC (and its associated RNIC handle) in the verbs interface. Similarly, an endpoint (EP) corresponds to an underlying queue pair (QP). On the other hand, an event dispatcher (EVD) has a corresponding completion queue (CQ) but additionally provides OS-supported event handling through the it_evd_wait() and it_evd_callback_*() calls. Event handling is another example for OS-supported functionality that should reside within the uAL/kAL.

RNIC Programming Interface (RNICPI)

Based on the RDMAC Verbs [VERBS-RDMAC] and the InfiniBand Verbs [IB-R1.2], the RNICPI [RNICPI-V1.0] defines a generic programming interface for the vendor independent integration of both iWARP RNICs and InfiniBand HCAs into Unix-like operating systems and is supported by a majority of OS and RNIC vendors. RNICPI takes into account requirements specific to the iWARP transport such as connection setup with immediate or deferred transition to RDMA mode. As shown for iWARP in Fig. 4, RNICPI is located between RDMA OS support and the implementation of the iWARP protocol layers.

Regarding RNICPI’s RDMA object hierarchy, the distinction between Verbs Providers (VPs) and logical RNICs is worth mentioning because it affects the organization and identification of RDMA objects. Each VP provides both a kVP module and a uVP library. A consumer first creates a logical RNIC within the VP and then creates RDMA objects such as protection domains, queue pairs etc. within the logical RNIC. A VP advantageously organizes RDMA objects per logical RNIC.

Examples

The following examples simultaneously illustrate the programming model and the underlying protocol actions for Send/Receive and RDMA Write operations.

Figures 2 and 3 show a Send and its matching Receive operation, also referred to as two-sided communication semantics. In Fig. 2, the application is assumed to have preregistered two source buffers (MRs) with STags s1 and s2 and to have posted a Send work request (WR) to endpoint EP1 (more precisely, to its SQ). The implementation gets the WR, transfers the data directly from buffer s1 to the RNIC, and forms the Send message - Fig. 2 illustrates a case where one TCP segment contains multiple DDP segments. When the implementation can guarantee that data will be delivered reliably, it indicates completion by posting a work completion to the EVD (CQ). At this time, control is returned to the consumer waiting on the EVD. For Fig. 3, the application has already preregistered destination buffers (MRs) d1 and d2 and posted a Receive WR to EP2. When the Send message from EP1 comes in, the implementation gets the WR and transfers the data directly from the RNIC to buffer d1. In our animation (Fig. 3), the DDP segments are placed in order, but the implementation may perform out-of-order placement of DDP segments. Finally, the implementation indicates delivery by posting a work completion to the EVD. At this time, control is returned to the consumer waiting on the EVD.

Figures 4 and 5 depict an RDMA Write operation at the data source and data sink, respectively. Since RDMA operations have one-sided communication semantics, the consumer at the data sink does not post a WR for being able to receive. Instead, the consumer at the data sink preregistered a destination buffer d1 and preadvertised d1 to its peer. The consumer at the data source posts an RDMA Write WR targeting the destination buffer with STag d1, which is sent along with the RDMA Write message. When the incoming RDMA Write has been placed, delivery is indicated to RDMAP, but no completion is generated at the data sink.