In this post, we will continue where our previous post left off and tackle the next step in building a machine learning project which is making our first fully functional training/evaluation pipeline.
As a quick refresher, remember that our goal is to apply a data-driven solution to the problem of fake news detection taking it from initial setup through to deployment. The phases we will conduct include the following:
3. Building and testing the pipeline with a v1 model (this post!)
This article will focus a lot on the machine learning modelling aspects but will also emphasize how to engineer a good pipeline using appropriate abstractions. Full source code is here.
Data Preprocessing
As a first step in building our model, we need to get data in the right format for ingestion into the training pipeline. Recall that in the last post, we had done some initial exploratory analysis to understand our dataset's characteristics.
Now we want to take those data insights and clean and preprocess the initial raw data into the X -> y mapping we need for supervised learning.
To do this, we will define a number of stand-alone scripts that do this cleaning and preprocessing.
We use stand-alone scripts because this makes it easier to separate out pieces of functionality that can be eventually be used as executable stages in our pipeline. In practice, these stages are eventually automated as workflows with tools like Airflow.
We first define a script called normalize_and_clean_data.py (full reference here). This script first reads in our data:
and then performs the normalization and cleaning:
Each normalize_and_clean function cleans one field from the original dataset. As an example, for the speaker_title field we have:
To produce consistent entries, this function lowercases the field, strips off any whitespace, and then replaces certain characters like dashes.
Afterwards to ensure terms that are semantically consistent but spelled differently (something that happened quite a bit in the original dataset) are mapped identically, we convert them all to a canonical form through a mapping we create by looking through the data. This makes it so that terms like talks show host and talk show host both map to talk show host.
Sometimes during model training, we will want to extract some metrics from our data and use them as additional features. These could be something like historical user activity over the last year for a recommendation system.
To make this concrete, for our use case we discovered during exploratory data analysis that distributions of "credit history" counts could be useful information to provide our model. Because the spread of counts can be pretty wide, we bin the values into 10 bins.
Because our offline feature extraction could be an expensive operation (it isn't here but let's exercise good engineering practice), we separate out the bin-forming into its own script compute_credit_bins.py whose main functionality is:
One engineering design decision that's worth noting here: for each of these cleaning and processing steps, we ensure that the scripts operate on input files and produce output files. This guarantees that our inputs remain immutable and our stage is easily reproducible.
With that, we are ready to get our hands dirty with the model (i.e the "real" machine learning).
Defining the Model
When defining the model, we certainly have a number of choices around architecture and model class. The exact model we choose is a function of a number of factors such as optimal performance on downstream metrics but also practical considerations like resource constraints or latency.
In the early stages of a project, one of the most important considerations is not to focus too much on getting the best-performing model.
Rather the goal is to get some model fully integrated into a pipeline where it is servicing requests to a user, so that you can actually measure the product-level metrics you care about (user engagement, click-through rate, etc.) These metrics will inform any further model development efforts.
So in the interest of getting some model up that performs reasonably well, we will choose a random forest as our baseline model. You can certainly use something even more simplistic (like a manual, rule-based system) if that allows you to get it fully integrated in your downstream application more quickly.
For our purposes, a random forest is relatively easy to get set up and it is often used as a go-to model class for any new problem. Random forests can be quite competitive and even state-of-the-art if tuned correctly.
But before we build our random forest, let's define a general model interface that we want our random forest (and any other models we build) to adhere to. This is an exercise in good software engineering.
Here is our base model class definition that all other models should inherit from. The full definition is here.
Given this model interface combined with the fact that we are leveraging Scikit-Learn to power our model, the actual model definition is not too bad:
While this seems complex, if you look closely the bulk of the actual model training is just a few lines. The remainder is making calls to our featurizer (details below) and performing setup/caching.
Features
Machine learning models on their own are useless without a good set of predictive features.
In our case, we want to give our random forest the right signals to learn adequate fake news discriminators.
Here we will want to leverage the insights we extracted about the data during our exploratory data analysis.
We will use two types of features for our initial model: 1) manual features and 2) ngram features.
Ngram features are often useful when dealing with text as they allow us to pick up on certain lexical and linguistic patterns in the data.
We will use tfidf-weights for these features rather than raw ngrams as these types of weight are often using in information retrieval to help upweight/downweight the importance of certain words.
A lot of the work done with ngram features is handled for us by libraries (like Scikit-learn).
We will augment this initial feature set with our own manual features. This category of features is where we can really codify our specific data insights.
As an example, we choose to extract and one-hot encode various fields such as speaker, speaker_title, state_info, and party_affiliation.
In addition, one particularly interesting field that we saw during our exploratory data analysis was the "credit history."
These integer-valued counts indicate how many times the given speaker historically made statements that were barely_true, false, half-true, mostly-true, or pants-fire (completely false). We bin these counts into one of 10 intervals, with each interval's edges defined in a field-specific fashion.
All this looks like this:
The heart of the featurization code will look something like this (defined using a bunch of Scikit-learn primitives):
One thing to note above is that once we've extracted separate ngram and manual feature vectors, we concatenate them into a single vector (via the feature union). This aggregated vector now stores all the most salient information we want to capture from our data and provide to our model.
A final comment about engineering: we will define a separate `Featurizer` class which will expose an interface for training the featurizer (necessary for the ngram-based weights) and featurizing arbitrary data:
The Training Pipeline
Ok, so now we have awesome features and a cool model. Let's weave them together into a functional training/evaluation pipeline.
One important step in any good training pipeline is making it easily configurable, i.e. making it easy to plug-and-play different components/attributes of the pipeline such as model, featurization, or data parameters.
You typically make these pipelines configurable by either passing in commandline-level arguments or by providing a configuration file.
I personally prefer the configuration route because it allows the actual configuration file to be version-controlled and tracked in the code repository.
Additionally, having more than a handful of commandline arguments quickly becomes difficult to deal with. Can you spot the error?
We will configure our pipeline using a JSON-formatted file. You can, of course, use other formats like YAML, etc. but I have a personal bias for JSON because I find it easier to read.
Another file format that I personally recommend because it's like JSON++ (with support for commenting, templating, etc.) is Jsonnet though I've stuck with JSON for simplicity-sake. For our random forest model, our config will look something like this:
The definition is relatively straightforward. We define our model type, where we should pull data from, where we should write the model and featurizer to, whether we are in evaluate or train mode, and any model-specific parameters.
We have kept the params empty here because we are using the Scikit-learn random forest defaults but you can use this field to specify anything you want to play with (i.e. number of trees, splitting criterion, etc.)
We next move to our train pipeline, which will be responsible for stitching together everything into an executable series of steps. We've done the bulk of the heavy-lifting in previous sections, so the pipeline is largely boilerplate:
While the script seems a tad long, there's really not much going on here: read the config, load up the data, train the model, evaluate the model, and log a bunch of stuff.
Here we are using the tracking API from MLFlow which is a super handy way to monitor training parameters, metrics, etc.
I personally find it more feature-rich and general than Tensorboard, but use what you're comfortable with.
Functionality Tests
So now we're written all this stuff. How do we know any of this even works?
We can certainly run the full pipeline (and we will!) but are we really going to have to retrain a model from scratch every time we make a small change to featurization code?
This quickly becomes impractical, especially if we are working with a team on a shared codebase (i.e. does everyone need to retrain a model locally if I make a small featurization change?)
To get around that problem, we need to include functionality tests. This is a very common practice in broader software engineering but sadly not something I've seen a lot of in the machine learning community.
There are a few resources that do a good job of covering the different kinds of testing in machine learning, so check them out for further details.
For our purposes we will write a few different functionality tests.
First off we will test our featurization code. This is super important.
Incorrect featurization means a confused model means a confused human.
We will test each of our normalization functions:
We will also test our modelling code. Here we will be checking to see that the output of our functions have the appropriate shape, that they are in the correct range (i.e. less than or equal to 1 if probabilities), and that we can overfit a small train set:
One additional kind of test we will employ that is very important is data integrity testing.
What we mean here is tests that check whether your data source has the expected format, types, range of values, etc.
If the data doesn't have the format you expect, the rest of your pipeline will be screwed up since data is the start of your training pipeline.
Data testing becomes especially important when you have a continuous ETL pipeline running that is ingesting, processing, and dumping new data periodically.
While here we are dealing with a static research dataset, we will go through the exercise of writing data integrity tests anyway.
To do this, we will use the library Great Expectations. It allows you to specify what you expect of your data in neat snippets of JSON.
For our purposes, we will check a few things in our data:
1. All of our data splits have the expected fields (columns)
2. The statement field is defined (has a length of at least 1)
3. Each of the "credit history" columns is greater than or equal to 0
4. The labels are booleans
In Great Expectations syntax that looks something like this:
Given this data suite, we will execute it programmatically with a Python script:
Note that Great Expectations also supports running your data suite directly from the commandline, but we are using a Python script we can invoke manually to make our lives a bit easier when we build a continuous integration system (in a later post of the series).
Putting It All Together
With all this piping in place, we can finally train our system:
Our output will look something like this:
And if run on the test set:
According to the results in Table 2 of the original paper, we outperform all of the models they analyzed. Nice!
And with that, we are done with this super long post. We've come a long way, going from a handful of data insights after our exploratory data analysis to a fully-fledged pipeline that trains a very competitive model on our dataset.
Taking a step back, remember that our goal was not to build the best model we could.
Our goal was to get some reasonably good model that we could expose to users to deliver value and initiate a data flywheel.
We could certainly eke out more performance on this model by performing hyperparameter tuning, playing with more features etc. But that is left as an exercise to the reader.
In the next post, we'll look at analyzing our model errors and building a v2 model!
Reproduced with permission from this post.
Comments