|
On servers handling heavy network traffic, an offloaded transport
protocol stack with support for Remote Direct Memory Access (RDMA)
can eliminate a bottleneck in network input/output (I/O) by avoiding
data copies between the operating system and application buffers.
The Internet Engineering Task Force (IETF) is defining a set of
protocols for (remote) direct data placement over IP networks. The
RDMA Consortium (RDMAC) has defined the semantics of an interface
to an RDMA-capable network interface card (RNIC), the so-called
RDMA protocol verbs. The IETF's RDMA protocol
stack, also known as the iWARP transport, is
implemented on RNICs or, more generally, by verbs
providers. The InfiniBand Trade Association (IBTA) is defining
another transport providing RDMA services.
OS extensions and programming interfaces for RDMA represent a significant
portion of the RDMA infrastructure, and their availability is a
key requirement for the success of RDMA technology.
Within the Interconnect Software Consortium (ICSC)
of The Open
Group, we contributed to the standardization of RDMA-enabled
programming interfaces, co-chairing both the Interconnect
Transport API (IT-API) and the RNIC Programming
Interface (RNICPI) work groups. We helped define a modular,
layered, and transport-neutral host software architecture for RDMA
through contributions [JAMENE-04]
to an industry-driven Linux open-source project called OpenRDMA
[OPENRDMA].
For the portability of high-performance RDMA-enabled applications,
it is desirable for OSes to provide an open, standardized, transport-neutral
and up-to-date RDMA API such as ICSC's IT-API. Similarly, for the
portability of RDMA device drivers, it is desirable to converge
to a standardized, syntactic programming interface that includes
the iWARP feature set of ICSC's RNICPI and takes care to reconcile
the semantic differences between iWARP and InfiniBand.
We have implemented elements of a host software
architecture for RDMA that provides the operating system (OS)
integration for both iWARP and InfiniBand, supporting IT-API and
an enhanced version of RNICPI. A key property of such an architecture
is a clean separation of generic/OS functionality and verbs-provider-specific
software functionality into user/kernel Access Layer
(uAL/kAL) and user/kernel Verbs Provider (uVP/kVP)
components, respectively. This approach permits a wide range of
RNICs / verbs providers to register themselves through a standard
programming interface and minimizes code bloat by keeping generic
functionality such as OS-wide resource management, event handling
or IT-API's socket conversion within the uAL or kAL.
We are currently developing open-source software components for
the RDMA host software infrastructure on Linux OS, initially focusing
on memory management issues such as pinning and on the iWARP transport
with its special requirements for connection management.
We are also working on a software implementation of the IETF's
iWARP protocol stack RDMAP/DDP/MPA referred to as SoftRDMA,
which enables RDMA on clients without RDMA hardware (supporting
servers relying on RDMA for performance) and allows the testing
of RNICs for iWARP protocol conformance.
|