In contemporary operating systems, TCP/IP processing is typically implemented within the OS kernel’s networking stack. With the advent of 1 Gb/s and 10 Gb/s Ethernet technology, the CPU load due to the corresponding network I/O operations has become a serious bottleneck, e.g., for servers handling bulk traffic such as storage servers in a SAN. In order to reduce the CPU load, context switch overhead and latency associated with network I/O, and to improve throughput at both low and high message rates, the networking industry and the IETF defined RDMA transports supporting (remote) direct data placement. These include the InfiniBand transport [IB-R1.2] and the IETF’s iWARP transport for (remote) direct data placement over the ubiquitous IP networks [RDMAP-IETF] [DDP-IETF] [MPA-IETF].
Comparison of network stack implementations
Figure 1 shows typical CPU load distributions due to high-speed network I/O for different network stack implementations, assuming 64 KB buffers. For Fig. 1(a), TCP/IP is running on the CPU and the load is dominated by data copying to and from temporary buffer space in a pool of kernel buffers; smaller contributions are due to TCP/IP processing, context switching and system calls, and device driver processing for the NIC. For Fig. 1(b), TCP/IP is running off CPU, for instance on an Ethernet NIC with a TOE, or on a second CPU in an SMP system; the load due to TCP/IP processing has been eliminated, but the data copy overhead remains. Finally, for Fig. 1(c), RDMA and TCP/IP protocols are running off CPU, which allows a zero-copy implementation through direct data placement and reduces the context switch / system call overhead through new, light-weight communication semantics (see Programming Interfaces for RDMA).
Figure 2 depicts network stack implementations corresponding to Fig. 1(a) and Fig. 1(c). With TCP/IP running on the CPU, an incoming packet is processed as follows. First, the packet is transferred from the NIC into a kernel buffer in DRAM, at a speed limited by the bandwidth of the I/O subsystem. Second, the CPU loads the packet from DRAM, processes the header and checks data integrity. Finally, the CPU stores the packet to the application buffer in DRAM. With an iWARP RNIC, however, the incoming packet is directly placed into a pre-registered application buffer, which eliminates at least two DRAM transfers per data byte.
Direct data placement
RDMA-enabled transports are designed to enable (remote) direct data placement both at the data source and the data sink, as shown in Fig. 3 for iWARP. The RNIC at the data source uses DMA to transfer data from a pre-registered application buffer (Src Buffer) into a protocol message, without involving the kernel/CPU. The RNIC at the data sink also uses DMA to move the data from a protocol message into a pre-registered application buffer (Sink Buffer). The pre-registration of an application buffer passes memory translation and protection information to an RNIC, which is later used by the RNIC for transferring data without involving the kernel/CPU. Applications initiate data transfer operations by posting so-called work requests to a send queue (SQ) or receive queue (RQ).
Figure 3 also shows that direct data placement is supported for both the traditional two-sided communication semantics (Send and matching Receive) and the new one-sided communication semantics (RDMA Write and RDMA Read). The RDMA Write message in Fig. 3 carries data and a tag ‘Ta’ identifying the data sink, previously advertised by the remote application. For the RDMA Read in Fig. 3, an RDMA Read Request message carries tags ‘Tb’ and ‘Tc’ selecting a sink buffer and source buffer, respectively, while the RDMA Read Response message carries data and the tag ‘Tb’ of the sink buffer. RDMA operations are called one-sided because they neither involve a “matching” work request at the remote side nor (in the absence of an error) an event notifying the remote application.
Figure 4 is a simplified view of interfaces and protocol layers for the iWARP transport. The illustration disregards the split between userspace and kernelspace software components. Applications can use a variety of APIs with explicit RDMA support (IT-API), implicit RDMA support (Sockets using SDP), or no RDMA support at all (Sockets using TCP/IP). Legacy applications that use the Sockets API and are not optimized for RDMA can still benefit implicitly from RDMA by running over SDP. Another intermediate solution is the use of eSockets, which provide asynchronous I/O and preregistration of memory regions, but no explicit RDMA semantics.
RDMA OS support is a generic software layer providing RDMA resource management (including sysfs support for Linux), memory management extensions, event handling and connection management services for RDMA consumers. The RDMAC Verbs Interface or RNICPI provides the device interface to RNICs. In case of TCP, an iWARP RNIC implements RDMAP/DDP/MPA [RDMAP-IETF] [DDP-IETF] [MPA-IETF] over TCP/IP.
Enabling applications for RDMA
High-performance applications are increasingly being optimized for RDMA-enabled transports. Figure 4 also illustrates how applications can take advantage of explicit or implicit RDMA support by using different APIs.
With the iSCSI protocol, an efficient mapping of SCSI over the TCP/IP stack was defined. By carrying SCSI commands over IP networks, this technology is used to provide access to storage over intranets and even long distances. Because of the ubiquity of IP networks, iSCSI can enable location-independent data storage and retrieval. By providing explicit RDMA support for iSCSI, the iSeR protocol for iWARP is an attractive enhancement for a high-performance, IP-based SAN [ISER].
Enhancements of NFS for RDMA, dubbed NFSeR, were defined by the IETF [NFS]. These enhancements provide a binding of NFS to RPC/RDMA. RPC uses XDR as a canonical, host-independent data representation. For XDR encoding, RPC data are marshaled into transport buffers, and for XDR decoding, data from transport buffers are unmarshaled into RPC result buffers. While NFS implementations over conventional RPC involve a data copy during both marshaling and unmarshaling, NFSeR uses RPC/RDMA, which avoids data copies through explicit RDMA support, i.e., by including placement information in RPC calls and replies.
RDMA also plays an increasingly important role in high-performance cluster computing applications, many of which are based on MPI. MPI’s parallel computing performance can be improved, e.g., by using RDMA Writes for implementing MPI Sends with long messages, or by speeding up the expensive MPI collective (synchronization) operations through RDMA operations.
More on RDMA in this article
We describe the common characteristics of programming interfaces for RDMA, give a short introduction to IT-API and RNICPI, and illustrate the relationship between the programming model and the underlying RDMAP/DDP protocol actions through examples.
A layered host software architecture for RDMA is presented and several reasons are given for a clean separation of generic and verbs-provider-specific software.