Building State-Of-The-Art Machine Learning Models With AutoGluon

Multi-layer stack ensemble architectures have shown to perform really well with certain Machine Learning problems. In this post, I provide a gentle introduction to AutoGluon, an open-source AutoML framework and an example of how you can quickly build state-of-the-art models. Good enough to get you a top 10% score on Kaggle!

AutoGluon and AutoML

AutoGluon is an open-source AutoML framework built by AWS, that enables easy to use and easy to extend AutoML. It enables you to achieve a state of art predictive accuracy by utilizing state of the art deep learning techniques without expertise. It is also a quick way to prototype what you can achieve from your dataset as well as get an initial baseline for your machine learning. AutoGluon currently supports working with tabular data, text prediction, image classification, and object detection.

AutoML frameworks exist to reduce the bar for getting started with machine learning. They take care of the heavy lifting tasks like data preprocessing, feature engineering, algorithm selection, and hyperparameter tuning. This means, given a dataset and a machine learning problem, keep training different models with different combinations of hyperparameters until you find the optimum combination of model and hyperparameters — also referred to as CASH (combined algorithm/hyperparameter tuning). Existing AutoML frameworks include SageMaker Autopilot, Auto-WEKA, and Auto-sklearn.

AutoGluon is different from other (traditional) AutoML frameworks it does more than CASH (combined algorithm/hyperparameter tuning).

Ensemble Machine Learning and Stacking

Before diving into AutoGluon, it is useful to revisit ensemble machine learning and stacking. Ensemble learning is a machine technique of training many (purposefully) weak models in parallel to solve the same problem. An ensemble consists of a set of individually trained classifiers, such as neural networks or decision trees, whose predictions are combined when classifying new instances. The basic idea behind this machine learning technique is that many models are better than few and models that learn differently can boost accuracy even if they perform worse in isolation.

In most cases, a single base algorithm is selected to build multiple models, whose results are then aggregated. This is also known as homogenous method of ensemble learning, like the random forest algorithm is one of the most common and popular homogenous ensemble learning techniques where multiple trees are trained to predict the same problem, and then a majority vote is taken among them. Other examples of homogeneous methods include bagging, rotational forest, random subspace, etc.

In contrast, the heterogeneous methods involve using different machine learning base algorithms like decision trees, artificial neural networks, etc for creating the models that are used for ensemble learning. Stacking is a common heterogeneous ensemble learning technique.

AutoGluon uses a multi-layer stack ensemble and we will look into how that works next.

How AutoGluon Works

AutoGluon operates in the supervised machine learning domain. This means that you need to have labeled input data that you use to train. AutoGluon takes care of the preprocessing, feature engineering, and generates models based on the machine learning problem you are trying to solve.

A major part of AutoML relies on hyperparameter tuning for generating and selecting the best models. Hyperparameter tuning involves finding the best combination of hyperparameters for a machine learning algorithm that provides the best model. The search strategy for the best set of parameters is based on random search, grid search, or bayesian optimization (which SageMaker uses).

However, with traditional hyperparameter tuning approaches, you use a lot of compute resources since most of the not-so-well tuned models end up getting used, a lot of overhead waste. Finally, there is also a risk of overfitting the validation (hold out) data as every time you run a tuning process, you check on the validation dataset and you end up overfitting the validation dataset.

A key difference between AutoGluon and other AutoML frameworks is that AutoGluon uses (almost) every model that was trained to generate the final prediction (instead of selecting the best candidate model after hyperparameter tuning).

Memory and State

AutoGluon is memory aware, it ensures that trained models do not exceed the memory resources available to it.

AutoGluon is state aware, it expects models to fail or time out during training and gracefully skips failed ones to move on to the next one. As long as you have one successful model generated, AutoGluon is ready to go.

AutoGluon relies on strategies like multi-layer stack ensembles. It automatically does k-fold bagging with out-of-fold prediction generation to ensure that there is no overfitting. Specifically, it leverages modern deep learning techniques and also does not require any data preprocessing.

AutoGluon in Action

AutoGluon also supports text and image, but for this post, we are focusing on AutoGluon Tabular. AutoGluon tabular works on supervised machine learning problems of classification and regression. You can either specify the type of problem upfront or AutoGluon will automatically pick one based on your dataset.

The Dataset

For the dataset, we are using the popular, open-source Titanic Dataset from Kaggle. The Dataset contains training data, which consists of labeled data for 851 passengers on board the RMS Titanic and if they survived the disaster or not. The dataset also includes a test set that has 418 passengers on board, but without the label, that is if they survived or not.

The challenge is to predict, if a passenger survived or not based on included features like name, age, gender, socio-economic class, etc.

Setup

AutoGluon installation is straightforward with just a couple of lines

pip install -U setuptools wheel
pip install -U "mxnet<2.0.0"
pip install autogluon

Training

To start the training, we begin by importing TabularPrediction from AutoGluon and then loading the data. AutoGluon can currently operate on data tables already loaded into Python as pandas DataFrames, or those stored in files of CSV format or Parquet format.

from autogluon.tabular import TabularDataset, TabularPredictor

train_data = TabularDataset('train.csv')

Once you have loaded the data, you can start training immediately, all you need to do is point to the training data and specify the name of the column you want to predict.

The training is started with the .fit() method

predictor = TabularPredictor(label='Survived').fit(
    train_data,
    auto_stack=True,
    presets='best_quality',
    time_limit=600,
    eval_metric='roc_auc'
)

Setting auto_stack = True allows AutoGluon to manage the number of stacks it will create automatically. You can optionally specify the number of stacks you want via the stack_ensemble_levels parameter.

The presets parameter allows you to choose the type of models you want to generate, for instance, if latency and time are not a constraint presets='best_quality' will often generate more accurate models. On the other hand, if you know latency is going to be a constraint, you can set presets=['good_quality_faster_inference_only_refit', 'optimize_for_deployment'] to generate models that are more optimized for deployment and inference.

You can optionally specify the time you want the training to run and AutoGluon will automatically wrap up all training jobs within that time.

The eval_metric parameter allows, you to specify the evaluation metric AutoGluon will use for validating the models. The default is accuracy.

Once training is started, AutoGluon will start logging messages as it proceeds with different stages of the training.

As the training proceeds, AutoGluon will also log the evaluation scores for the various models it generates.

It is important to note here that unless you have an explicit need to specify a validation dataset, you should directly send all the training data to AutoGluon. This allows AutoGluon to automatically choose a random training/validation split of the data in an efficient manner.

After training is completed you can start making inferences with the .predict() function.

test_data = TabularDataset('test.csv')
predictions = predictor.predict(test_data)

In the above example, we trained the Titanic dataset using nothing but the default settings.

The results we achieved were state of art. An accuracy of ~78 and a place in the top 8%-10% for the competition on Kaggle.

What If You Want to Use One Or Some of The Models

By default AutoGluon automatically chooses the best multi-layer stack ensemble to run your predictions, however, you can also get a listing of all the models AutoGluon generated along with their specific performance metrics by generating a leaderboard with a single line of code:

predictor.leaderboard(extra_info=True, silent=True)

Using this leaderboard, you can select a specific stack of ensembles that you want to run by simply specifying the index number.

What About Hyperparameter Tuning

Contrary to what you may have experienced working with other machine learning frameworks, you may not need to do any hyperparameter tuning with AutoGluon. In most cases, you will get the best accuracy by setting the auto_stack = True or manually specifying stack_levels along with num_bagging_folds.

However, AutoGluon does support hyperparameter tuning via the hp_tune = True parameter. When you do enable hyperparameter tuning, AutoGluon will only create models with the base algorithm for which you have specified the hyperparameter settings. It will skip the rest.

In the above example code, AutoGluon will train neural network and various tree-based models and tune the hyperparameters for each of those models within the search space specified.

Concluding Notes

While AutoGluon can build state of the art machine learning models directly, I find it as useful as my new go-to baselining method.

Though AutoGluon does the data preprocessing and feature engineering (and does it really well), you will find that you can get better performance if you preprocess and feature engineer the data before training with AutoGluon. In general better data almost always leads to better models.

samx18.io