Logistic regression classifier trained in 91.5 seconds, 46× faster than the best result reported so far

A growing number of small and medium enterprises rely on machine learning as part of their everyday business.

—Celestine Dünner, IBM scientist

We have developed an efficient, scalable machine-learning library that enables very fast training of generalized linear models. We have demonstrated that our library can remove the training time as a bottleneck for machine-learning workloads, paving the way to a range of new applications. For instance, it allows more agile development, faster and more €fine-grained exploration of the hyper-parameter space, enables scaling to massive datasets and makes frequent retraining of models possible in order to adapt to events as they occur.

Cloud resources are typically billed by the hour, so the time required to train machine-learning models is directly related to outgoing costs.

—Thomas Parnell, IBM scientist

Our library, called Snap Machine Learning (Snap ML), combines recent advances in machine-learning systems and algorithms in a nested manner to reflect the hierarchical architecture of modern distributed systems. This allows us to leverage available network, memory and heterogeneous compute resources effectively. On a terabyte-scale publicly available dataset for click-through-rate prediction in computational advertising, we demonstrate the training of a logistic regression classifi€er in 1.53 min.

The three main features that distinguish Snap ML are

  • Distributed training: We built our system as a data-parallel framework, enabling us to scale out and train on massive datasets that exceed the memory capacity of a single machine, which is crucial for large-scale applications.
  • GPU acceleration: We implemented specialized solvers designed to leverage the massively parallel architecture of GPUs while respecting the data locality in GPU memory to avoid large data transfer overheads. To make this approach scalable, we take advantage of recent developments in heterogeneous learning in order to achieve GPU acceleration even if only a small fraction of the data can indeed be stored in the accelerator memory.
  • Sparse data structures: Many machine-learning datasets are sparse. Therefore we employ new optimizations for the algorithms used in our system when applied to sparse data structures.
Snap ML

System description

Data parallelism in a cluster

Data-parallelism across worker nodes in a cluster

The first level of parallelism spans individual worker nodes in a cluster. The data is stored in a distributed manner across multiple worker nodes that are connected over a network interface. This data-parallel approach enables the training on large-scale datasets that exceed the memory capacity of a single device.

Data parallelism in one node

Parallelism across heterogeneous compute units within one worker node

On the individual worker nodes, we can leverage one or multiple accelerator units such as GPUs by systematically splitting the workload between the host and the accelerator units. The different workloads are then executed in parallel enabling full utilization of all hardware resources on each worker and hence achieving the second level of parallelism across heterogeneous compute units.

Multi-core parallelism

Multi-core parallelism within individual compute units

In order to execute efficiently the workloads assigned to the individual compute units, we leverage the parallelism provided by the respective compute architecture. We use specially designed solvers to take full advantage of the massively parallel architecture of modern GPUs and implement multi-threaded code to process the workload on the CPU. This results in the additional, third level of parallelism across cores.

Video

Documentation

[ More ]

Snap ML documentation

Publications

C. Dünner, T. Parnell, D. Sarigiannis, N. Iouannou, A. Anghel and H. Pozidis,
SnapML: A Hierarchical Framework For Machine Learning,”
to appear at NIPS, 2018.

C. Dünner, M. Gargiani, A. Lucchi, A. Bian, T. Hofmann and M. Jaggi,
A Distributed Second-Order Algorithm You Can Trust,”
ICML, 2018.

T. Parnell, C. Dünner, K. Atasu, M. Sifalakis and H. Pozidis,
Tera-Scale Coordinate Descent on GPUs,”
FGCS, 2018.

C. Dünner, T. Parnell and M. Jaggi,
Efficient Use of Limited-Memory Accelerators for Linear Learning on Heterogeneous Systems,”
NIPS, 2017.

C. Dünner, T. Parnell, K. Atasu, M. Sifalakis and H. Pozidis,
Understanding and Optimizing the Performance of Distributed Machine Learning Applications on Apache Spark,”
IEEE Big Data, 2017.

T. Parnell, C. Dünner, K. Atasu, M. Sifalakis and H. Pozidis,
Large-Scale Stochastic Learning using GPUs,”
IPDPSW – ParLearning, 2017.

C. Dünner, S. Forte, M. Takac and M. Jaggi,
Primal-Dual Rates and Certificates,”
ICML, 2016.