Field programmable gate arrays (FPGAs) are making their way into data centers (DC). They serve to offload and accelerate service-oriented tasks such as web-page ranking, memory caching, deep learning, network encryption, video conversion and high-frequency trading.

However, FPGAs are not yet available at scale to general cloud users who want to accelerate their own workload processing. This puts the cloud deployment of compute-intensive workloads at a disadvantage compared with on-site infrastructure installations, where the performance and energy efficiency of FPGAs are increasingly being exploited.

cloudFPGA solves this issue by offering FPGAs as an IaaS resource to cloud users. Using the cloudFPGA system, users can rent FPGAs — similarly to renting VMs in the cloud — thus paving the way for large-scale utilization of FPGAs in DCs.

The cloudFPGA system is built on three main pillars:

  • the use of standalone network-attached FPGAs,
  • a hyperscale infrastructure for deploying the above FPGAs at large scale and in a cost-effective way,
  • an accelerator service that integrates and manages the standalone network-attached FPGAs in the cloud.

Hyperscale infrastructure

To enable cloud users to rent, use and release large numbers of FPGAs on the cloud, the FPGA resource must become plentiful in DCs.

The cloudFPGA infrastructure is the key enabler of such a large-scale deployment of FPGAs in DCs. It was designed from the ground up to provide the world’s highest-density and most energy-efficient rack unit of FPGAs.

The infrastructure combines a passive and an active water-cooling approach to pack 64 FPGAs into one 19"×2U chassis. Such a chassis is made up of two Sleds, each with 32 FPGAs and one 64-port 10GbE Ethernet switch providing 640 Gb/s bi-sectional bandwidth.

In all, 16 such chassis fit into a 42U rack for a total of 1024 FPGAs and 16 TB of DRAM.

Accelerator Service: Management of Cloud FPGAs at scale

Today, the prevailing way to incorporate an FPGA into a server is to connect it to the CPU over a high-speed, point-to-point interconnect such as the PCIe bus, and to treat that FPGA resource as a co-processor worker under the control of the server CPU.

However, because of this master–slave programming paradigm, such an FPGA is typically integrated in the cloud only as an option of the primary host compute resource to which it belongs. As a result, bus-attached FPGAs are usually made available in the cloud indirectly via Virtual Machines (VMs) or Containers.

In our deployment, in contrast, a stand-alone, network-attached FPGA can be requested independently of a host via the cloudFPGA Resource Manager (cFRM, see figure). The cFRM provides a RESTful (Representational State Transfer) API (Application Program Interface) for integration in the Data Center (DC) management stack (e.g. OpenStack).

Cloud integration is the process of making a resource available in the cloud. In the case of cloudFPGA, this process is done by the combination of three levels of management (see Figure): A cloudFPGA Resource Manager (cFRM), a cloudFPGA Sled Manager (cFSM), and an cloudFPGA Manager Core (cFMC).

  1. There is one resource manager per DC to control many Sleds. The cFRM handles the user images and maintains a database of FPGA resources.
  2. There is one sled manager for every 32 FPGAs. The cFSM runs on a service processor that is part of the Sled. It powers the FPGAs on and off, monitors the physical parameters of the FPGAs, and runs the SW management stack of the Ethernet switch.
  3. There is one cFMC per FPGA. The cFMC contains a simplified HTTP server that provides support for the REST API calls issued by the cFRM.

In the end, the components of all levels work together to provide the requested FPGA resources in a fast and secure way.

FPGA architecture

System architecture for the cloudFPGA platform. 32 FPGAs, one switch and a service processor are combined on one carrier board and called Sled. The management tasks are split into three levels — cloudFPGA Resource Manager (cFRM), cloudFPGA Sled Manager (cFSM), and cloudFPGA Manager Core (cFMC). A Sled is half of a 2U chassis. The OpenStack compute resources (Nova) CPU nodes are also available for creating heterogeneous clusters.

Ask the experts

François Abel

François Abel

IBM Research Scientist

Dionysios Diamantopoulos

Dionysios Diamantopoulos

IBM Research Staff Member

Christoph Hagleitner

Christoph Hagleitner

IBM Research Scientist

Burkhard Ringlein

Burkhard Ringlein

Predoctoral Researcher

Publications

  1. B. Ringlein, F. Abel, A. Ditter, B. Weiss, C. Hagleitner and D. Fey,
    Programming Reconfigurable Heterogeneous Computing Clusters Using MPI With Transpilation,”
    in 2020 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC), 2020. PDF [BibTeX Citation]
  2. @INPROCEEDINGS{Ringlein2020b,
    author={B. {Ringlein} and F. {Abel} and A. {Ditter} and B. {Weiss} and C. {Hagleitner} and D. {Fey}},
    booktitle={2020 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC)},
    title={Programming Reconfigurable Heterogeneous Computing Clusters Using MPI With Transpilation},
    year={2020},
    volume={},
    number={},
    pages={1-9},
    doi={10.1109/H2RC51942.2020.00006}}

  3. B. Ringlein, F. Abel, A. Ditter, B. Weiss, C. Hagleitner and D. Fey
    ZRLMPI: A Unified Programming Model for Reconfigurable Heterogeneous Computing Clusters,”
    in IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2020. PDF [BibTeX Citation]
  4. @InProceedings{Ringlein2020,
    author = {B. {Ringlein} and F. {Abel} and A. {Ditter} and B. {Weiss} and C. {Hagleitner} and D. {Fey}},
    booktitle = {2020 IEEE 28\textsuperscript{th} Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)},
    title = {{ZRLMPI: A Unified Programming Model forReconfigurable Heterogeneous Computing Clusters}},
    year = {2020},
    month = may,
    pages = {220},
    publisher = {IEEE},
    date = {3-6 May 2019},
    doi = {10.1109/FCCM48280.2020.00051},
    eventdate = {3-6 May 2019},
    groups = {own},
    isbn = {978-1-7281-5803-7},
    issn = {2576-2621},
    journaltitle = {Proceedings of the 28\textsuperscript{th} IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM)},
    location = {Fayetteville, Arkansas},}

  5. B. Ringlein, F. Abel, A. Ditter, B. Weiss, C. Hagleitner and D. Fey
    System architecture for network-attached FPGAs in the cloud using partial reconfiguration,”
    in 29th International Conference on Field Programmable Logic and Applications (FPL), 2019. PDF [BibTeX Citation]
  6. @INPROCEEDINGS{8892175,
    author={B. {Ringlein} and F. {Abel} and A. {Ditter} and B. {Weiss} and C. {Hagleitner} and D. {Fey}},
    booktitle={2019 29th International Conference on Field Programmable Logic and Applications (FPL)},
    title={System Architecture for Network-Attached FPGAs in the Cloud using Partial Reconfiguration},
    year={2019},
    volume={},
    number={},
    pages={293-300},
    abstract={Emerging applications such as deep neural networks, bioinformatics or video encoding impose a high computing pressure on the Cloud. Reconfigurable technologies like Field-Programmable Gate Arrays (FPGAs) can handle such compute-intensive workloads in an efficient and performant way. To seamlessly incorporate FPGAs into existing Cloud environments and leverage their full power efficiency, FPGAs should be directly attached to the data center network and operate independent of power-hungry CPUs. This raises new questions about resource management, application deployment and network integrity. We present a system architecture for managing a large number of network-attached FPGAs in an efficient, flexible and scalable way. To ensure the integrity of the infrastructure, we use partial reconfiguration to separate the non-privileged user logic from the privileged system logic. To create a really scalable and agile cloud service, the management of all resources builds on the Representational State Transfer (REST) concept.},
    keywords={cloud computing;computer centres;field programmable gate arrays;power aware computing;agile cloud service;system architecture;network-attached FPGA;partial reconfiguration;reconfigurable technologies;Field-Programmable Gate Arrays;compute-intensive workloads;power efficiency;data center network;application deployment;network integrity;representational state transfer;resource management;power-hungry CPU;Field programmable gate arrays;Cloud computing;Servers;Systems architecture;Computer architecture;Program processors;Ethernet;cloud computing;network-attached FPGA;stand-alone FPGA;partial reconfiguration;data centers},
    doi={10.1109/FPL.2019.00054},
    ISSN={1946-147X},
    month={Sep.},}

  7. F. Abel, J. Weerasinghe, C. Hagleitner, B. Weiss, S. Paredes
    An FPGA Platform for Hyperscalers,”
    in IEEE 25th Annual Symposium on High-Performance Interconnects (HOTI), Santa Clara, CA, pp. 29–32, 2017. PDF
  8. J. Weerasinghe, F. Abel, C. Hagleitner, A. Herkersdorf,
    Disaggregated FPGAs: Network performance comparison against bare-metal servers, virtual machines and Linux containers,”
    in IEEE International Conference on Cloud Computing Technology and Science (CloudCom), Luxembourg, 2016. PDF
  9. J. Weerasinghe, R. Polig, F. Abel,
    Network-attached FPGAs for data center applications,”
    in IEEE International Conference on Field-Programmable Technology (FPT ’16), Xian, China, 2016. PDF
  10. J. Weerasinghe, F. Abel, C. Hagleitner, A. Herkersdorf,
    Enabling FPGAs in hyperscale data centers,”
    in IEEE International Conference on Cloud and Big Data Computing (CBDCom), Beijing, China, pp. 1078–1086, 2015. PDF