In-Depth Tutorial - Productionising a Classifier¶
In this section, we’re going to productionise a Random Forest classifier written with sklearn, deploy it to the cloud, and use it in a more sophisticated workflow.
By the end of the tutorial, you will learn how to build modules with dependencies, write more sophisticated workflows, and build abstractions over data-sources. Enjoy!
So far, we have built and published a Python module with a single function on it,
numChars, and built a workflow which connects our function to an HTTP endpoint. This in itself isn’t particularly useful, so, now that you’ve got the gist of how NStack works, it’s time to build something more realistic!
In this tutorial, we’re going to create and productionise a simple classifier which uses the famous iris dataset. We’re going to train our classifier to classify which species an iris is, given measurements of its sepals and petals. You can find the dataset we’re using to train our model here.
First, let’s look at the the format of our data to see how we should approach the problem. We see that we have five fields:
||The species of iris||Text|
||The width of the sepal||Double|
||The length of the sepal||Double|
||The width of the petal||Double|
||The length of the petal||Double|
If we are trying to find the species based on the sepal and petal measurements, this means these measurements are going to be the input to our classifier module, with text being the output. This means we need to write a function in Python which takes four
Doubles and returns
Creating your classifier module¶
To begin, let’s make a new directory called
cd into it, and initialise a new Python module:
~/ $ mkdir Iris.Classify; cd Iris.Classify ~/Iris.Classify/ $ nstack init --language python python module 'Iris.Classify' successfully initialised at ~/Iris.Classify
Next, let’s download our training data into this directory so we can use it in our module. We have hosted it for you as a CSV on GitHub.
~/Iris.Classify/ $ curl -O https://raw.githubusercontent.com/nstackcom/nstack-examples/master/iris/Iris.Classify/train.csv
Defining our API¶
As we know what the input and output of our classifier is going to look like, let’s edit
module.nml to define our API (i.e. the entry-point into our module). By default, a new module contains a sample function
numChars, which we replace with our definition. We’re going to call the function we write in Python
predict, which means we write our
module.nml as follows:
module Iris.Classify:0.1.0 fun predict : (Double, Double, Double, Double) -> Text
This means we want to productionise a single function,
predict, which takes four
Doubles (the measurements) and returns
Text (the iris species).
Writing our classifier¶
Now that we’ve defined our API, let’s jump into our Python module, which lives in
We see that NStack has created a class
Module. This is where we add the functions for our module. Right now it also has a sample function in it,
numChars, which we can remove.
Let’s import the libaries we’re using.
import nstack import pandas as pd from sklearn.ensemble import RandomForestClassifier
Python modules must also import
Before we add our
predict function, we’re going to add
__init__, the Python constructor function which runs upon the creation of our module. It’s going to load our data from
train.csv, and use it to train our Random Forest classifier:
def __init__(self): train = pd.read_csv("train.csv") self.cols = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width'] colsRes = ['class'] trainArr = train.as_matrix(self.cols) trainRes = train.as_matrix(colsRes) rf = RandomForestClassifier(n_estimators=100) rf.fit(trainArr, trainRes) self.rf = rf
Now we can write our
predict function. The second argument,
inputArr, is the input – in this case, our four
Doubles. To return text, we simply return from the function in Python.
def predict(self, inputArr): points = [inputArr] df = pd.DataFrame(points, columns=self.cols) results = self.rf.predict(df) return results.item()
Configuring your module¶
When your module is started, it is run in a Linux container on the NStack server. Because our module uses libraries like
sklearn, we have to tell NStack to install some extra operating system libraries inside your module’s container. NStack lets us specify these in our
nstack.yaml configuration file in the
packages section. Let’s add the following packages:
packages: ['numpy', 'python3-scikit-learn', 'scipy', 'python3-scikit-image', 'python3-pandas']
Additionally, we want to tell NStack to copy our
train.csv file into our module, so we can use it in
nstack.yaml also has a section for specifying files you’d like to include:
Publishing and starting¶
Now we’re ready to build and publish our classifier. Remember, even though we run this command locally, our module gets built and published on your NStack server in the cloud.
~/Iris.Classify/ $ nstack build Building NStack Container module Iris.Classify. Please wait. This may take some time. Module Iris.Classify built successfully. Use `nstack list functions` to see all available functions.
We can now see
Iris.Classify.predict in the list of existing functions (along with previously built functions like
~/Iris.Classify/ $ nstack list functions Iris.Classify:0.0.1-SNAPSHOT predict : (Double, Double, Double, Double) -> Text Demo:0.0.1-SNAPSHOT numChars : Text -> Integer
Our classifier is now published, but to use it we need to connect it to an event source and sink. In the previous tutorial, we used HTTP as a source, and the NStack log as a sink. We can do the same here. This time, instead of creating a workflow module right away, we can use nstack’s
notebook command to test our workflow first.
notebook opens an interactive shell where we can write our workflow. When we are finished, we can
~/Iris.Classify/ $ nstack notebook import Iris.Classify:0.0.1-SNAPSHOT as Classifier; Sources.http<(Double, Double, Double, Double)> | Classifier.predict | Sinks.log<Text> [Ctrl-D]
This creates an HTTP endpoint on
http://localhost:8080/irisendpoint which can receive four
Doubles, and writes the results to the log as
Text. Let’s check it is running as a process:
~/Iris.Classify/ $ nstack ps 1 2
In this instance, it is running as process
2. We can test our classifier by sending it some of the sample data from
~/Iris.Classify/ $ nstack send "/irisendpoint" '[4.7, 1.4, 6.1, 2.9]' Message Accepted ~/Iris.Classify/ $ nstack log 2 Feb 17 10:32:30 nostromo nstack-server: OUTPUT: "Iris-versicolor"
Our classifier is now productionised.