Master’s student or intern

Efficient Scale-out Execution of Distributed Serverless Applications

Ref. 2021_044

Serverless computing is a cloud-computing execution model in which the cloud provider dynamically manages the allocation of machine resources. As a cloud service, it is becoming increasingly popular due to its high elasticity and fine-grained billing. Serverless platforms, like AWS Lambda, Google Cloud Functions, Azure Functions and IBM Code Engine enable users to quickly launch thousands of light-weight tasks (as opposed to entire virtual machines), while automatically scaling compute, storage and memory according to application demands at millisecond granularity. While serverless platforms were originally developed for web microservices, their elasticity advantages in particular make them appealing for a wider range of applications such as interactive big-data analytics and machine learning.

These workloads typically consist of a large amount of tightly coupled, interdependent tasks (as opposed to traditional serverless applications, where functions run independently of each other) which adds significant complexity when scaling applications: A task scheduler needs to ensure that task dependencies are met, execute tasks in the optimal order and select the right resource for each task based on state (e.g., cached data), location (e.g., same or different node) and availability.

Scaling out, i.e., using many distributed resources, is the primary way of accelerating such workloads. The further a workload is scaled-out, the more tasks are necessary to make use of the additional resources and the fewer data each tasks processes. This, however, comes at the cost of increased scheduler and relative task execution overheads, among others. Hence, the potential acceleration from scaling out is eventually overshadowed by the increasing overheads, at which point the workload cannot be accelerated any further by scaling out. This is further complicated by the fact that this point is different for each workload. To address these problems, one can reduce the overheads involved in scaling out and therefore push the maximal acceleration potential further, and automatically determine the workload-specific maximal scale-out level, e.g., using artificial intelligence.

We’ve been working successfully on a serverless resource manager that uses novel resource management and task scheduling techniques to execute Apache Spark applications faster and more efficiently on shared Kubernetes clusters. It powers a serverless Spark service for our clients in the IBM cloud. In this context, we would like to explore novel techniques to address the above-mentioned issues and to build a truly serverless Spark service that can utilize resources even more efficiently and scale-out even small workloads to the optimal level without the need for user guidance.

We are inviting applications from students to conduct their master’s thesis work or an internship project at the IBM Research lab in Zurich on this exciting new topic. The research focus will be on exploring techniques for efficient auto-scaling of Spark applications and/or techniques to reduce overheads related to scale-out.


The ideal candidate should be well versed in distributed systems, resource management and scheduling, and have strong programming skills (Scala/Java, C++). Hands-on experience with distributed container orchestration systems (Kubernetes) and Apache Spark would be highly desirable.


IBM is committed to diversity at the workplace. With us you will find an open, multicultural environment. Excellent flexible working arrangements enable all genders to strike the desired balance between their professional development and their personal lives.

How to apply

If you are interested in this position, please submit your application below.

For more information on technical questions please contact
Dr. Michael Kaufmann ().