Master Thesis, Internship

Internship or Master Thesis on ML-based optimization of LLM kernels for multiple platforms

Ref. 2024_024

Project description

Manually optimized kernels (e.g., flash attention) are critical for the performance of LLM inference and training. However, most of these kernels have typically been carefully optimized for a specific GPU platform and may pose a serious obstacle to the portability of LLM applications. Consequently, to achieve high-performance on different GPUs, LLM kernels need to be re-implemented or manually re-optimized.

Open AI Triton (https://github.com/triton-lang/triton) has recently emerged as a promising open-source alternative to writing custom CUDA kernels. It enables one to write kernels for execution on GPUs using simple Python code. Triton kernels can be both highly performant, as well as portable across different GPU architectures. For this reason, Triton is growing in popularity, and many LLM inference frameworks, e.g. vLLM (https://github.com/vllm-project/vllm), already include several kernels written in Triton.

Despite the promise of Triton being adaptable to many different GPU platforms, to do so still requires manual performance fine-tuning in practice. In this context, we aim to answer the following research questions:

  1. Can ML-based performance models of Triton kernels predict their performance on different hardware with sufficiently high-accuracy? Can these performance models then recommend the best adaptation strategies?
  2. How can we adapt existing Triton kernels to new platforms like NVIDIA’s Tensor Memory Accelerator? Potentially using established compiler-based optimizations?
  3. Can we build an autonomous pipeline for fine-tuning Triton kernels for heterogeneous hardware?

Qualifications:

  • Enrolled or in possession of a Master's degree in computer science with a keen interest in data storage research, cloud computing and performance engineering.
  • Excellent coding skills: Familiarity with python, pytorch, or other ML frameworks.
  • Familiarity with Linux environments and software development tools (git/GitHub, IDEs, virtual machine and containers etc.).
  • High amount of creativity and outstanding problem-solving ability.

Preferred Qualifications:

  • Experience with CUDA or triton.
  • Experience with LLM inference applications.
  • Experience in machine learning.
  • Excellent oral and written English with good presentation skills.
  • Strong interpersonal skills and excellent written and verbal communication.

Diversity

IBM is committed to diversity at the workplace. With us you will find an open, multicultural environment. Excellent flexible working arrangements enable all genders to strike the desired balance between their professional development and their personal lives.

How to apply

Please submit your application through the link below. This position is available starting immediately or at a later date.