Master’s student or intern

Prototyping Efficient Collective Operations for Cloud HPC

Ref. 2021_007

In classical high-performance computing, Collective Communication Operations, such as Reduce, Scatter, Gather, or Broadcast are at the core of HPC type workloads. Its efficient execution is crucial for overall job execution efficiency, and therefore provided by highly hardware optimizable communication libraries such as MPI, or NCCL. With the convergence of big data analytics, machine learning and HPC workloads in one next generation cloud infrastructure, a cloud native communication service with similar functionality and performance becomes a necessity.

The project aims at investigating the integration of a Collective Communications service with a highly efficient ephemeral object store. It will target the introduction of such service with the Ray distributed computation framework. Currently, we plan to prototype such service as an extension of Ray's 'Plasma' object store. To build a usable prototype, the project will comprise the definition of an API and the implementation of selected, but not all basic collective operations. It is intended to follow, and potentially contribute to the Ray's community effort on this topic, see the relevant RFC.

Diversity

IBM is committed to diversity at the workplace. With us you will find an open, multicultural environment. Excellent flexible working arrangements enable all genders to strike the desired balance between their professional development and their personal lives.

How to apply

We are inviting applications from students to conduct their master’s thesis work or an internship project at the IBM Research lab in Zurich on this exciting new topic. The research focus will be on exploring techniques for providing efficient communication primitives for HPC-like workloads running on Cloud environment. It also involves interactions with several researchers focusing on various aspects of the project. The ideal candidate should be well versed in distributed systems, and have strong programming skills (C++, Python). Hands-on experience with distributed container orchestration systems (Kubernetes) and serverless environments (KNative) would be desirable.

For more information on technical questions please contact Dr. Bernard Metzler ().