Master’s thesis or internship

Scaling of a Deep Learning Framework to Support Machine Learning Models with up to 100 Quadrillion Parameters

Ref. 2022_003

Job Description

Deep learning-based models have become the standard method for building production recommender systems. The recommendation quality of such systems is proportional with the model complexity, and, in recent years, we have witnessed an exponential growth of the model scale – from Google’s 2016 model with 1 billion parameters to the latest Facebook’s model with 12 trillion parameters. PERSIA1 is a recent state-of-art open-source framework designed for training of such large deep learning-based recommender systems. The goal of this new framework is to provide a fast, efficient, and scalable architecture that allows parallel distributed training of deep neural net machine learning models with a very large number of parameters.

Training of machine learnings models with up to 100 trillion parameters, corresponding to a model size of ~200TB, was already demonstrated using the PERSIA framework on a cluster of 138 servers. The next challenge is in scaling the framework to handle an order of magnitude larger parameter space for the machine learning model. The main obstacle to overcome is to enable storing the large number of model parameters in secondary storage. Currently, the PERSIA framework relies on storing all model data in DRAM which is extremely memory intensive and cost inefficient. The goal of this project is to explore ways to partition a multi-petabyte model across multiple servers, store the model partitions on fast storage devices such as NVMe SSDs and then efficiently query and update the model parameters remotely over a fast network.

Requirements

We are inviting applications from students to conduct their master’s thesis work or an internship project at the IBM Research lab in Zurich on this exciting new topic. The research focus will be on exploring how to extend the PERSIA data path to enable model training when the model parameters reside on secondary storage. The ideal candidate should be well versed in distributed systems, operating systems, and have strong programming skills (C/C++ or Rust). Hands-on experience with complex software systems such as machine learning frameworks, databases or storage systems is a plus.

Diversity

IBM is committed to diversity at the workplace. With us you will find an open, multicultural environment. Excellent flexible working arrangements enable all genders to strike the desired balance between their professional development and their personal lives.

How to apply

If you are interested in this exciting position, please submit your most recent curriculum vitae.

For more information on technical questions please contact
Dr. Radu Stoica ().