Master’s student or intern

Prototyping Efficient Collective Operations for Cloud HPC

Ref. 2021_041

In classical high-performance computing, Collective Communication Operations, such as Reduce, Scatter, Gather, or Broadcast are at the core of HPC type workloads. Their efficient execution are crucial for overall job execution efficiency, and therefore provided by highly hardware optimizable communication libraries such as MPI, or NCCL. With the convergence of big data analytics, machine learning and HPC workloads in one next generation cloud infrastructure, a cloud native communication service with similar functionality and performance becomes a necessity.

The project aims to investigate the integration of a Collective Communications service with a highly efficient ephemeral cloud object store. Contradicting the static allocation of resources and fixed communication groups of classical HPC workload execution, it will target flexible, ad-hoc adaptation to changing resource requirements and changing communication group membership. We plan to prototype such service as an extension of an available ephemeral object store, such as Ray’s “Plasma” data store, but are not limited to it. To build a usable prototype, the project will comprise the definition of an API and the implementation of selected, but not all basic collective operations. It is intended to potentially contribute the findings of the project to the open-source community. See the RFC to exemplify the community discussion of the problem.

We are inviting applications from students to conduct their master’s thesis work or an internship project at the IBM Research lab in Zurich on this exciting new topic. The research focus will be on exploring techniques for providing efficient communication primitives for HPC-like workloads running on Cloud environment. It also involves interactions with several researchers focusing on various aspects of the project. The ideal candidate should be well versed in distributed systems, and have strong programming skills (C++, Python). Hands-on experience with distributed container orchestration systems (Kubernetes) and serverless environments (KNative) would be desirable.

Diversity

IBM is committed to diversity at the workplace. With us you will find an open, multicultural environment. Excellent flexible working arrangements enable all genders to strike the desired balance between their professional development and their personal lives.

How to apply

If you are interested in this position, please submit your application below.

For more information on technical questions please contact
Dr. Bernard Metzler ().