Optimizing Machine Learning
Snap ML is
Multi-threaded CPU solvers as well as GPU and multi-GPU solvers that offer significant acceleration over established libraries.
Distributed solvers (for generalized linear models currently) that scale gracefully to train TB-scale datasets in mere seconds.
A novel gradient boosting machine that achieves state-of-the-art generalization accuracy over a majority of datasets.
Ability to complete large training jobs in less resources, with high resource utilization.
Familiar Python scikit-learn APIs for single-server solvers and Apache Spark API for distributed solvers.
Gradient Boosting Machine
Gradient Boosting models comprise an ensemble of decision trees, similar to a random forest (RF). Although Deep neural networks achieve state-of-the-art accuracy on image, audio and NLP tasks, on structured datasets Gradient Boosting usually out-performs all other models in terms of accuracy. Some of the most popular Boosting libraries are XGBoost, LightGBM and CatBoost. Snap ML introduces SnapBoost, which targets high generalization accuracy through a stochastic combination of base learners, including decision trees and Kernel ridge regression models. Here are some benchmarks of SnapBoost against LightGBM and XGBoost, comparing accuracy across a collection of 48 datasets. SnapBoost learns a better model in about 2-out-3 of the datasets tested.
OpenML (www.openml.org) is a platform for collaborative data science. Snap ML’s Gradient Boosting was benchmarked against XGBoost and LightGBM using 48 binary classification datasets from OpenML. Hyper-parameter tuning and generalization estimation was performed using 3x3 nested cross-validation. Snap ML provides best-in-class accuracy for a majority of datasets.