Version your machine learning models with Sacred

Over the last few years the machine learning landscape was very active and a lot of developments took place. From the point of view of a software developer and architect it seems that there is a lack of structured software development tools. The typical data scientist (I know that I don't have enough data to prove this claim) will test a lot of things inside a jupyter notebook and when the perfect parameters are found the model will be put into production. The process of finding the best parameters consists of a lot of trial and error. And it is not easy to keep track of everything that was tried. But what if I want to go back to that thing that was kinda working a few days ago? Introducing Sacred, a tool that will version your parameters and on top of that with Sacredboard you can visualize your metrics for each experiment. In this blog post I will show you how to use Sacred to version a MNIST classifier.


Both sacred and sacredboard can be installed with pip. I always use virtual environments and I suggest you do the same. If you have switched to your machine learning environment just type in pip install sacred and pip install sacredboard. I also use MongoDB to store the results of the experiments. To connect to a MongoDB sacred uses pymongo. You can install it using pip install pymongo. For the machine learning model I will use Keras with tensorflow as the backend framework. You can use any framework you want so if you want to use your own model you can do that.

The model

MNIST is one of the famous Hello World datasets in machine learning. When it comes to image recognition and CNNs (Convolutional Neural Networks) it is THE Hello World dataset. The goal is to classify handwritten digits from zero to nine. This dataset is used all over the place so you can find a lot of help. The Keras project has a lot of examples on GitHub and we use This is a simple CNN that has good results on the dataset.

import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K

batch_size = 128
num_classes = 10
epochs = 12

# input image dimensions
img_rows, img_cols = 28, 28

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dense(128, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))

              metrics=['accuracy']), y_train,
          validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

If you copy and paste this script and let it run (this may take some time depending on your machine) you will get the first results. But now you want to tweak some parameters and compare the models.

importing Sacred

The cool thing about sacred is that you can just sprinkle some decorators on your script and suddenly you can version your models. First of all copy your script to a new file called At the very beginning of the new script a sacred experiment has to be created. This class handles everything sacred related and can be initiated by giving it just a name. I will use the name mnist_cnn.

from sacred import Experiment
ex = Experiment("mnist_cnn")

I want to store the information in a MongoDB. To do that I need a MongoObserver class. You can find a lot of information on how to connect to a MongoDB instance on the internet so it is not in the scope of this post. I use mlab to create free MongoDB instances for testing purposes.

from sacred.observers import MongoObserver

If you already ran the original script you might have seen that there is a lot of output in the terminal. Sacred will log this output into a file. But we want it to behave like a terminal, so we add a captured out filter.  Feel free to try leaving out this option.

from sacred.utils import apply_backspaces_and_linefeeds
ex.captured_out_filter = apply_backspaces_and_linefeeds

Decorating the script

The imports above are all we need and now we can start decorating our script. You may wonder: "But there aren't any functions yet?". That is right. Just wrap your complete script into a function and call it my_main. Decorate it with the decorator @ex.automain. This lets sacred know that this function contains the definition and the training of your model. To measure your success you should return the accuracy of the model (or any other meaningful metric if your are using a different dataset).

def my_main():
    # ... old script ...
    return score[1]

In the next step we want to make it possible to configure the experiment. To do that we need another function we call my_config. In this function we will define the parameters of the experiment. For now we will only use the batch_size and the number of epochs.

def my_config():
    batch_size = 128
    epochs = 12

To apply the parameters to our experiment they have to be added to the signature of our my_main function like this:

def my_main(batch_size, epochs):

Note that you have to remove the two lines from the my_main function where the parameters where set previously.

a first test

Now we are ready to test if sacred already works. Go to the terminal and start an experiment with two epochs.

python with epochs=2

The experiment should start running with two epochs.

Monitor your experiment with sacredboard

While your experiment is running you can boot up sacredboard. All you have to do is start it in the terminal using the MongoDB url. It will automatically open in your standard browser. You can open it with the following command:

sacredboard -mu mongodb:// sacred

The second parameter is the database name. I called it sacred. Here is a look at sacredboard:

As you can see both our config parameters appear on the screen. We can see that our experiment has finished by now and we have an accuracy of 0.9867 after two epochs. The console output can be found under "Captured output". If you click on "Metrics plots" you see that there aren't any metrics. But if we want to compare several models in the future we need some metrics. Therefore we have to add just another function to our code.

Logging Metrics

To log metrics like the loss or the accuracy over time we need to add a new function decorated with @ex.capture. I called it my_metrics:

def my_metrics(_run, logs):
    _run.log_scalar("loss", float(logs.get('loss')))
    _run.log_scalar("acc", float(logs.get('acc')))
    _run.log_scalar("val_loss", float(logs.get('val_loss')))
    _run.log_scalar("val_acc", float(logs.get('val_acc')))
    _run.result = float(logs.get('val_acc'))

As you can see there are two parameters. A _log objects that gets passed from sacred and a logs variable that we will pass ourselves. We have to adjust our model script a bit to return the values to sacred. First we create a subclass of keras.callbacks.Callback and then we tell the model to call this class. The class only has one method called on_epoch_end and it just calls our my_metrics method. What this means is that each time an epoch ends sacred will log the metrics.

from keras.callbacks import Callback

class LogMetrics(Callback):
      def on_epoch_end(self, _, logs={}):

# ... no changes ..., y_train,
      validation_data=(x_test, y_test),

You could also store the weights of the model (or any other file) as an artifact by calling add_artifact(path) on the _run object. This will of course increase the needed size of your database. Now we are ready to call our experiment again. This time we will train the model over 10 epochs to see a meaningful plot.

python with epochs=10

While the experiment is running you can check sacredboard for the updated results. 

While the experiment is running you can see how the metrics change with each epoch


In this blog post I showed you how to use sacred to version your machine learning experiments. It offers you an easy way to start experiments with different parameters and observe how they change the outcome. On top of that you can come back to these results any time. The coding overhead is very low. Just add some decorators and functions here and there and everything works. I really like sacred and I will use it heavily in the future. 

THe complete code