Knowledge discovery through data mining and machine learning have recently proliferated in our digital universe. The generation of huge amounts of data and the high computational power now available have fueled an inexorable desire to capture and analyze this data to uncover hidden patterns that promise to lead to better insights and improved decision making.

However, as data volumes increase faster than our ability to process them, it becomes crucial to revisit traditional methods of computing. This has led to the emergence of new data processing frameworks such as MapReduce and Spark, that are better suited to the new data-centric computing paradigm.

Our research in data-centric computing for analytics applications is built on two main pillars, namely infrastructure and algorithms. On the infrastructure front, we are designing new user-level I/O architectures that enable high-performance data movement and integrate seamlessly with existing data processing frameworks. On the algorithmic front, we are developing new software infrastructure and algorithms to accelerate large-scale machine-learning workloads with a focus on improving performance and achieving scalability.

Scalable large-scale machine learning

We live in a world where data is continuously collected from various sources, whether by individuals or by businesses of all sizes. Businesses would ideally like to mine all this data to obtain knowledge and act upon it to improve their competitive edge.

However, data volumes are increasing at a much faster rate than processing power, and this is creating the need for parallelization and distributed processing. This has led to the development and rapid proliferation of distributed computing frameworks such as MapReduce and Spark.

These open-source frameworks offer extensive library support for machine learning, analytics and a growing ecosystem of developers. However, deployment of machine-learning workloads on Spark does not in itself guarantee scalability, and faster processing may not always be achieved. In fact, the development of learning algorithms that scale gracefully with the dataset size and exhibit linear convergence speed-up with the amount of computing resources is currently an open issue.

For certain computationally intensive problems, such as deep learning, as well as for very large datasets, such as collections of high-resolution medical images, large text corpora or large graphs, even the most scalable learning algorithms may not suffice to obtain acceptable solutions within a reasonable time.

In such cases, acceleration by specialized engines such as GPUs or FPGAs is one possible solution. However, although acceleration of machine-learning tasks on directly attached GPUs and, to a lesser extent, FPGAs is well established, acceleration on distributed, heterogeneous platforms is still largely unexplored territory.

 

Our focus

In this project we develop new software infrastructure and algorithms for the acceleration of large-scale machine learning workloads, with a focus on performance improvement and scalability. In particular, we strive to develop new distributed machine learning algorithms, optimization schemes and associated runtime libraries that exhibit fast convergence and linear speed-up with the number of available computing resources. We target online learning paradigms (mostly variants of stochastic gradient descent or stochastic coordinate descent) to solve large-scale problems.

Furthermore, we aim to accelerate selected computational modules of key large-scale workloads by offloading them to accelerators (GPUs and FPGAs) that are federated and clustered around the computing cores. We leverage the powerful and established Spark framework for parallel and distributed processing on general purpose CPU clusters, and extend Spark on distributed clusters of heterogeneous components, namely GPUs and FPGAs, to boost its performance and scalability.

Equally important, we rely on the computing power, large memory bandwidth and cache coherency of the IBM Power architecture to achieve performance benefits on key analytics workloads. We also leverage our deep expertise in Flash and next-generation nonvolatile memory to create optimized learning algorithms that make the best use of the available memory and storage resources.

 

Recent results

On the algorithmic front, our main focus is on the optimization framework that forms the basis of any machine-learning algorithm.

In other words, we are studying variants of stochastic gradient descent and coordinate descent, primal/­dual problems and variance reduction. We are also exploring parallelization strategies through delayed or asynchronous gradient updates, parameter server or mini-batch schemes.

One important objective is to understand tradeoffs between computation, communication, memory and I/O access, and to optimize them to achieve theoretically optimal performance [1].

On the acceleration front, we start by offloading parts of each worker’s computation on a GPU. For this we build on existing knowledge and flexible programming frameworks such as CUDA.

In order to train machine-learning models quickly on large datasets that do not fit within the memory of a single GPU, we have developed techniques that combine GPU acceleration and distributed learning [2]. Using these techniques, one can attain a high degree of training accuracy in a few seconds on datasets comprising up to 200 million examples and 75 million features.

 

Ask the experts

Thomas Parnell

Thomas Parnell

IBM Research scientist

Kubilay Atasu

Kubilay Atasu

IBM Research scientist

Haris Pozidis

Haris Pozidis

IBM Research scientist

Celestine Duenner

Celestine Duenner

IBM Research scientist

Manolis Sifalakis

Manolis Sifalakis

IBM Research scientist