Data Science Project Lifecycle

Ever wondered how a Data Science / Deep Learning project looks like in an industrial environment? In this article, I will present you the typical lifecycle of such a project. This lifecycle consists of 6 different steps which are:

  • Goal Setting
  • Data Exploration
  • Data Preprocessing
  • Model Prototyping
  • Model evaluation
  • Deployment to Production

Before we focus on these 6 steps, let us first discuss what a project lifecycle is.

What is Project Lifecycle?

The work of a deep learning engineer is very similar to the work of a software developer. At first glance, it is not that obvious. The product of software developers is the code, while the product of deep learning engineers are the insights they get from the data.

However, to arrive at these useful insights it's required to write code. And these insights might be wrapped in code of a software product to make it easily consumable.

In software development, there is a more or less standard project lifecycle that evolved over the decades. This life-cycle includes standard processes like: planning, development, testing, integration, and deployment.

Deep learning as a new emerging young field has borrowed some best practices from the software developers in order to improve its own work process, including a template for the project lifecycle.

A lifecycle is the definition, execution, and automation of business processes towards the goal of coordinating tasks and information between people and systems.

A predefined lifecycle helps the software developers and deep learning engineers alike to keep a clear structure during their projects and achieve the goal of the project.

Of course, there is no template for solving a deep learning problem nor a perfect lifecycle. A good lifecycle for a particular team depends on the tasks, goals, and values of that team, whether they want to make their work faster, more efficient, compliant, agile, transparent, or reproducible. A tradeoff often exists between different goals and values — do I want to get something done quickly or do I want to invest time now to make sure that it can be done quickly next time?

Yoda

The roadmap changes with every new dataset and new problem. But in general, every data science or deep learning project you will encounter in your career will follow more or less the lifecycle template we are going to discuss in this article.

This project lifecycle consists of several parts. Some are executed one after the other, some parts represent an iterative process. In either case, you are undertaking the steps of this project lifecycle to achieve a certain result. This result can be a service for your client or even an entire software product.


1st Step: The Objective

Goal Setting

Every deep learning project starts with an objective. When solving a specific deep learning problem, the first question you will encounter is always what is actually the problem you are trying to solve.

You identify your problem and set a specific goal you want to achieve. Goal-setting is probably the most important step to start any project. If we don’t state a clear goal, co-workers in a project won’t be able to collaborate, actions won’t be aligned and the goal of the project is unlikely to be achieved.

Accordingly, every deep learning project aims to fulfill an objective. The range of goals can vary from enhancing the current, to the development of completely new models. In any case accurate goal-setting is paramount for any project.

Deep Learning projects are well-suited for accurate goal-setting because we evaluate the results of our model with a performance metric and can see if the necessary performance is achieved.

A particular goal-setting method that can be effectively utilized in a deep learning project is called Objectives and Keys results.


2nd Step: Data Exploration

Data exploration is usually the second step in a deep learning project. When training a neural network you must always keep in mind, that the quality of the training data determines the quality of your neural network model.

The data you will encounter in practice will be not clean in most cases. This means the data will contain non-uniform data formats, missing values, outliers, and features with very different ranges. In short, the data would not be ready to be used as training data for your model.

For this reason, the data must be preprocessed in various ways. But before we can do it, we must first identify the issues I just mentioned. This is where the phase of data exploration comes into place. During the exploration phase, you want to visualize your data in various ways and perform basic statistics on it.

For example, a simple 1D scatter plot can be used to visualize a feature from the dataset:

Scatter Plot

The scatter plots give us a very simple visual overview of the values within a feature.

We can do a more statistical approach to data visualization by using boxplots. By means of a box plot, we gain knowledge about some statistical properties of the data the visualize such as the median values, quartile ranges, outliers, etc. :

Box Plots

Another advanced way to visualize the data is by using a histogram. As in the case of a 1 D scatter plot we obtain a good overview of the range of the given data as well as potential outliers. In addition to it this kind of visualization provides us with an accurate representation of the distribution of the given feature values:

Histogram


3rd Step: Data Preprocessing

Any problems in your dataset that you have discovered in the exploration phase can be dealt with in the pre-processing phase. Here you take care of missing values, different feature ranges, different data formats and perform the necessary data transformations.

Shortly speaking in this step you make your data ready for the training.

Different data types require different preprocessing methods. Take numeric data as an example.

Numeric Data

Numeric data represents data in the form of scalar values. These scalar values have a continuous range. This means there is an infinite amount of possible values. Integers and floating-point numbers are the most commonly used numeric data types.

Although numeric data can be fed directly into neural network or machine learning models, the data may require some preprocessing steps, for example, if the range of input features are very different. Very different feature ranges will cause problems during the training.

To prevent this problem we must scale and standardize the ranges of these features. For this, we can use several different feature scaling methods, which will be introduced in the following articles.

Categorical String Data

Another data-type that requires preprocessing is categorical data. Categorical data is data that generally takes a limited number of possible values such as age, gender, country and so on.

Usually, you encounter categorical data as a string data type in your dataset. Some examples would be:

  • “Country”
  • “Age”
  • “Gender”
  • “Job”

Take the feature "Country" as an example. This feature could contain string values like “Germany”, “USA”, “France” etc. Of course, neural networks or traditional machine learning models can not work with string data directly, and you need to make some changes and transformations of that data before you can train a neural network or a machine learning model on it.

In the case of string data, we must perform the so-called encoding of the data.

Encoding basically means that the string data gets transformed into a numeric datatype such as integer or floating-point numbers. For example

  • Germany” → 1
  • USA” → 2
  • France” → 3

Although the string data just became numeric, that “could” be used by a model, the data still remains categorical. Categorical data may cause problems during training because the model could interpret the value of “France” which has a categorical value of 3 as more important during the training process as the categorical value of 1 of “Germany”.

By using these categorical values we would introduce a bias into the learning process, which could disturb the natural learning process of a neural network or machine learning model. At the end of the days we must let the models decide for themselves, if and to what extend a particular data sample is important or not.

The issue of introducing a bias can be solved by performing the so-called one-hot-encoding of the data. One-hot-encoding would vectorize these values so that each sample “would have the same importance in the eye of a neural network or the ML model” during training. Basically this would happen:

  • Germany” → 1 → [1,0,0]
  • USA” → 2 → [0,1,0]
  • France” → 3 → [0,0,1]

4th Step: Model Prototyping

Rapid Prototyping

After data preprocessing the main part of your project begins where you prototype, implement and train your model.

In this phase, you will experiment a lot with your neural network or machine learning model. This involves trying out many different model architectures, experimenting with the number of layers, the number of neurons, values of the hyperparameters, training time, etc.

Since this step alone is very extensive, I will describe it in more detail in a later article.


5th Step: Model Evaluation

After you have implemented your neural network model you can start to deal with the question of whether the goal of your project is achieved. In other words, you check if your model fulfills the necessary predefined performance.

The evaluation metrics that you will choose to measure the performance of your model will differ from project to project and will depend on what exactly you want to achieve.

That is the reason why you have to be specifically clear about your goal. As clear as your goal is as clear you must define the metrics that measure the performance. Some evaluation metrics that can be used to measure the performance are:

  • Accuracy
  • Precision
  • Recall
  • F1 Score
  • AUC Score

Be careful. Some evaluation metrics might suggest your model performs very well, while others may result in a bad performance. For example in the case of a very imbalanced dataset where one class represents 99% of the data and the other class 1%, the accuracy would be a wrong metric to evaluate the perfromance.

Bad Luck

If you have found a suitable neural network model with satisfactory performance, that can handle the problem you are dealing with, you can proceed with the deployment step.


6th Step: Deployment to Production

Deployment means that you are going to prepare your model for usage in real life. Up until this point the model was executed locally. In the deployment phase, you will make your model available to the whole world.

For example, the neural network could be used as a microservice that is running on a web-could, that can be accessed from around the globe and can be scaled easily to handle 100 or ten of thousands of simultaneous requests.

For this, the trained model must be exported in a suitable format. For example , in the case of neural network models written in TensorFlow, the format is called a SavedModel instance.

Docker Kubernetes AWS

It is a common good practice in the industry to use Docker and Kubernetes as main tools to deploy a model. Docker would serve as a containerization engine that packs the model together will all necessary libraries and ships it as one package in an isolated container.

With Kubernetes, the Docker container where the model is running can be easily managed, scaled and deployed to a web-cloud such as Amazon Web Services.

With this last step the lifecycle is complete ;)

Deep Learning Project Lifecycle