Production ready Jupyter Notebooks

Python Jupyter notebooks are fundamental part of Data Science prototyping. Deploying this code in efficient manner is troublesome and it often means copy-pasting the prototype into the final product. Later when the product requires new features the code is copied once again, which is human error prone.

In Kyriba we wanted to have consistent code between development and production environment. This is now limited to scheduled jobs producing some artefacts like report or data to ingest. If you just want to regularly materialize output of Jupyter notebooks for the human users, you may find https://github.com/nteract/papermill tool interesting.

Jupyter is just IDE, store the code in Git.

You should always version control your code. To do that easily just mount the directory with repository already set up into the Jupyter instance. You have to enable write access for Jupyter to notebooks directory once in the project lifetime with chmod o+w notebooks.

# start_jupyter.sh
docker run --rm  -p 8888:8888 \
-e JUPYTER_ENABLE_LAB=yes \
-v $(pwd)/notebooks:/home/jovyan/work \
jupyter/pyspark-notebook:3395de4db93a

That way every file you create in Jupyter’s work directory can be instantly saved & synchronised with Git.

Beware of Jupyter noise

The .ipynb files contain a lot of noise does not have to be tracked by VCS. It includes output of notebooks and number representing order of cell execution history. We recommend to configure Pre-Commit hooks (https://pre-commit.com/) to remove it before committing the result.

repos:
  -   repo: https://github.com/pre-commit/pre-commit-hooks
      rev: v3.2.0
      hooks:
        -   id: trailing-whitespace
        -   id: end-of-file-fixer
  -   repo: https://github.com/kynan/nbstripout
      rev: 0.3.9
      hooks:
        -   id: nbstripout
            # You may want to preserve the output.
            #args: ["--keep-output"]

Environment aware

For local development let's hardcode configuration in the first cell. Optionally you can use local .env files. The cell has to be ignored in production environment and there are two ways of doing that. The easiest one is to set up the tag eg. developement for this cell and later filter out every cell with this tag. Use top-right corner "Property Inspector" → "Add tag". Other way of doing that is through nbconvert templates.

As Jupyter has still no managed way of setting up Conda environment for notebooks, you can use this cell for installing the required packages.

# Cell with developement configuration
%env DB_URL=jdbc:postgresql://my.local.postgres.server.at.kyriba/mylocaldb
%env SPARK_MASTER=local[4]
%env PREDICTION_DATE_START='2022-01-17'
import sys
!{sys.executable} -m pip install --quiet lightgbm==3.1.1

In the subsequent cell you load the configuration, whether it comes from previous cell or production environment variables injected to your container.

import os
prediction_date_start = os.getenv("PREDICTION_DATE_START")
db_url =  os.getenv("DB_URL")
spark_master = os.getenv("SPARK_MASTER")

Organise the modules

One of most common issues when prototyping with Jupyter is code duplication and lack of unit tests. The solution is to extract functions to local modules and reuse them - for example between backtest and actual prediction notebooks.

# myforecasting.py
from lightgbm import LGBMRegressor
import pandas as pd
def forecast():
  # (...)
  model = lgbm.fit(X_train, Y_train)

Such module with functions deserves proper unit testing integrated with your CI environment. If you frequently modify the content of the module remember to enable auto reloading, otherwise you will work with obsolete functions definitions.

%load_ext autoreload
%autoreload 2

Side note for spark users, get familiar with spark.sparkContext.addPyFile or this article (https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html)

Ship it!

To generate the code from notebooks you can order your CI to use nbconvert tool. With --to scriptoption you can convert your notebooks into executable .py files produced to jobs directory.

# package_notebooks.sh
jupyter nbconvert --to script --TagRemovePreprocessor.enabled=True \
 --TagRemovePreprocessor.remove_cell_tags="['developement']" \
 --TemplateExporter.exclude_markdown=True \
 --output-dir=jobs \
 notebooks/*.ipynb
# copy local modules as well
cp notebooks/*.py jobs/

I recommend not to track the jobs directory in VCS. Remember to include all custom modules as well using cp command line tool. You have many options on how to build docker image to run the jobs. The most simple (and quite option is to reuse same jupyter notebook image that you used it the local development. You have to maintain requirements.txt files with a global list of required dependencies.

FROM jupyter/pyspark-notebook:3395de4db93a
COPY jobs/requirements.txt jobs/
RUN conda install --quiet --yes \
    --file jobs/requirements.txt && \
    conda clean --all -f -y && \
    fix-permissions "${CONDA_DIR}" && \
    fix-permissions "/home/${NB_USER}"
COPY jobs/ jobs/
ENTRYPOINT ["/usr/local/bin/start.sh"]
CMD ["ls", "-la", "jobs/"]
docker run \
-e DB_URL=jdbc:postgresql://my.production.db.server.at.kyriba/myproddb \
-e SPARK_MASTER=my.prod.spark.cluster.at.kyriba \
-e PREDICTION_DATE_START=${schedule_date+1d}
--rm my.docker.registry/forecasting:0.1.0-snap-1-g26a9ca9 \
python jobs/Spark\ example.py

Such pipeline allows you to ensure code consistency between your development and production environment. The code is modular, tested and easily configurable to run in any environment. You can reuse the same approach in different projects as the boilerplate would be minimal.

Even if this article content matches mostly the scenario of running the scheduled jobs, some of techniques could be adapted in the offline training and online model serving pipeline.