Training a custom dlib shape predictor

In this tutorial, you will learn how to train your own custom dlib shape predictor. You’ll then learn how to take your trained dlib shape predictor and use it to predict landmarks on input images and real-time video streams.

Today kicks off a brand new two-part series on training custom shape predictors with dlib:

  1. Part #1: Training a custom dlib shape predictor (today’s tutorial)
  2. Part #2: Tuning dlib shape predictor hyperparameters to balance speed, accuracy, and model size (next week’s tutorial)

Shape predictors, also called landmark predictors, are used to predict key (x, y)-coordinates of a given “shape”.

The most common, well-known shape predictor is dlib’s facial landmark predictor used to localize individual facial structures, including the:

  • Eyes
  • Eyebrows
  • Nose
  • Lips/mouth
  • Jawline

Facial landmarks are used for face alignment (a method to improve face recognition accuracy), building a “drowsiness detector” to detect tired, sleepy drivers behind the wheel, face swapping, virtual makeover applications, and much more.

However, just because facial landmarks are the most popular type of shape predictor, doesn’t mean we can’t train a shape predictor to localize other shapes in an image!

For example, you could use a shape predictor to:

  • Automatically localize the four corners of a piece of paper when building a computer vision-based document scanner.
  • Detect the key, structural joints of the human body (feet, knees, elbows, etc.).
  • Localize the tips of your fingers when building an AR/VR application.

Today we’ll be exploring shape predictors in more detail, including how you can train your own custom shape predictor using the dlib library.

To learn how to train your own dlib shape predictor, just keep reading!

Looking for the source code to this post?
Jump right to the downloads section.

Tuning a custom dlib shape predictor

In the first part of this tutorial, we’ll briefly discuss what shape/landmark predictors are and how they can be used to predict specific locations on structural objects.

From there we’ll review the iBUG 300-W dataset, a common dataset used to train shape predictors used to localize specific locations on the human face (i.e., facial landmarks).

I’ll then show you how to train your own custom dlib shape predictor, resulting in a model that can balance speed, accuracy, and model size.

Finally, we’ll put our shape predictor to the test and apply it to a set of input images/video streams, demonstrating that our shape predictor is capable of running in real-time.

We’ll wrap up the tutorial with a discussion of next steps.

What are shape/landmark predictors?

Figure 1: Training a custom dlib shape predictor on facial landmarks (image source).

Shape/landmark predictors are used to localize specific (x, y)-coordinates on an input “shape”. The term “shape” is arbitrary, but it’s assumed that the shape is structural in nature.

Examples of structural shapes include:

  • Faces
  • Hands
  • Fingers
  • Toes
  • etc.

For example, faces come in all different shapes and sizes, and they all share common structural characteristics — the eyes are above the nose, the nose is above the mouth, etc.

The goal of shape/landmark predictors is to exploit this structural knowledge and given enough training data, learn how to automatically predict the location of these structures.

How do shape/landmark predictors work?

Figure 2: How do shape/landmark predictors work? The dlib library implements a shape predictor algorithm with an ensemble of regression trees approach using the method described by Kazemi and Sullivan in their 2014 CVPR paper (image source).

There are a variety of shape predictor algorithms. Exactly which one you use depends on whether:

  • You’re working with 2D or 3D data
  • You need to utilize deep learning
  • Or, if traditional Computer Vision and Machine Learning algorithms will suffice

The shape predictor algorithm implemented in the dlib library comes from Kazemi and Sullivan’s 2014 CVPR paper, One Millisecond Face Alignment with an Ensemble of Regression Trees.

To estimate the landmark locations, the algorithm:

  • Examines a sparse set of input pixel intensities (i.e., the “features” to the input model)
  • Passes the features into an Ensemble of Regression Trees (ERT)
  • Refines the predicted locations to improve accuracy through a cascade of regressors

The end result is a shape predictor that can run in super real-time!

For more details on the inner-workings of the landmark prediction, be sure to refer to Kazemi and Sullivan’s 2014 publication.

The iBUG 300-W dataset

Figure 3: In this tutorial we will use the iBUG 300-W face landmark dataset to learn how to train a custom dlib shape predictor.

To train our custom dlib shape predictor, we’ll be utilizing the iBUG 300-W dataset (but with a twist).

The goal of iBUG-300W is to train a shape predictor capable of localizing each individual facial structure, including the eyes, eyebrows, nose, mouth, and jawline.

The dataset itself consists of 68 pairs of integer values — these values are the (x, y)-coordinates of the facial structures depicted in Figure 2 above.

To create the iBUG-300W dataset, researchers manually and painstakingly annotated and labeled each of the 68 coordinates on a total of 7,764 images.

A model trained on iBUG-300W can predict the location of each of these 68 (x, y)-coordinate pairs and can, therefore, localize each of the locations on the face.

That’s all fine and good…

…but what if we wanted to train a shape predictor to localize just the eyes?

How might we go about doing that?

Balancing shape predictor model speed and accuracy

Figure 4: We will train a custom dlib shape/landmark predictor to recognize just eyes in this tutorial.

Let’s suppose for a second that you want to train a custom shape predictor to localize just the location of the eyes.

We would have two options to accomplish this task:

  1. Utilize dlib’s pre-trained facial landmark detector used to localize all facial structures and then discard all localizations except for the eyes.
  2. Train our own custom dlib landmark predictor that returns just the locations of the eyes.

In some cases you may be able to get away with the first option; however, there are two problems there, namely regarding your model speed and your model size.

Model speed: Even though you’re only interested in a subset of the landmark predictions, your model is still responsible for predicting the entire set of landmarks. You can’t just tell your model “Oh hey, just give me those locations, don’t bother computing the rest.” It doesn’t work like that — it’s an “all or nothing” calculation.

Model size: Since your model needs to know how to predict all landmark locations it was trained on, it therefore needs to store quantified information on how to predict each of these locations. The more information it needs to store, the larger your model size is.

Think of your shape predictor model size as a grocery list — out of a list of 20 items, you may only truly need eggs and a gallon of milk, but if you’re heading to the store, you’re going to be purchasing all the items on that list because that’s what your family expects you to do!

The model size is the same way.

Your model doesn’t “care” that you only truly “need” a subset of the landmark predictions; it was trained to predict all of them so you’re going to get all of them in return!

If you only need a subset of specific landmarks you should consider training your own custom shape predictor — you’ll end up with a model that is both smaller and faster.

In the context of today’s tutorial, we’ll be training a custom dlib shape predictor to localize just the eye locations from the iBUG 300-W dataset.

Such a model could be utilized in a virtual makeover application used to apply just eyeliner/mascara or it could be used in a drowsiness detector used to detect tired drivers behind the wheel of a car.

Configuring your dlib development environment

To follow along with today’s tutorial, you will need a virtual environment with the following packages installed:

  • dlib
  • OpenCV
  • imutils

Luckily, each of these packages is pip-installable, but there are a handful of pre-requisites including virtual environments. Be sure to follow these two guides for additional information:

The pip install commands include:

The workon  command becomes available once you install virtualenv  and virtualenvwrapper  per either my dlib or OpenCV installation guides.

Downloading the iBUG 300-W dataset

Before we get too far into this tutorial, take a second now to download the iBUG 300-W dataset (~1.7GB):

You’ll also want to use the “Downloads” section of this blog post to download the source code.

I recommend placing the iBug 300W dataset into the zip associated with the download of this tutorial like this:

Alternatively (i.e. rather than clicking the hyperlink above), use wget  in your terminal to download the dataset directly:

From there you can follow along with the rest of the tutorial.

Project Structure

Assuming you have followed the instructions in the previous section, your project directory is now organized as follows:

The iBug 300-W dataset is extracted in the ibug_300W_large_face_landmark_dataset/  directory. We will review the following Python scripts in this order:

  1. : Parses the train/test XML dataset files for eyes-only landmark coordinates.
  2. : Accepts the parsed XML files to train our shape predictor with dlib.
  3. : Calculates the Mean Average Error (MAE) of our custom shape predictor.
  4. : Performs shape prediction using our custom dlib shape predictor, trained to only recognize eye landmarks.

We’ll begin by inspecting our input XML files in the next section.

Understanding the iBUG-300W XML file structure

We’ll be using the iBUG-300W to train our shape predictor; however, we have a bit of a problem:

iBUG-300W supplies (x, y)-coordinate pairs for all facial structures in the dataset (i.e., eyebrows, eyes, nose, mouth, and jawline)…

…however, we want to train our shape predictor on just the eyes!

So, what are we going to do?

Are we going to find another dataset that doesn’t include the facial structures we don’t care about?

Manually open up the training file and delete the coordinate pairs for the facial structures we don’t need?

Simply give up, take our ball, and go home?

Of course not!

We’re programmers and engineers — all we need is some basic file parsing to create a new training file that includes just the eye coordinates.

To understand how we can do that, let’s first consider how facial landmarks are annotated in the iBUG-300W dataset by examining the labels_ibug_300W_train.xml training file:

All training data in the iBUG-300W dataset is represented by a structured XML file.

Each image has an image tag.

Inside the image tag is a file attribute that points to where the example image file resides on disk.

Additionally, each image has a box element associated with it.

The box element represents the bounding box coordinates of the face in the image. To understand how the box element represents the bounding box of the face, consider its four attributes:

  1. top: The starting y-coordinate of the bounding box.
  2. left: The starting x-coordinate of the bounding box.
  3. width: The width of the bounding box.
  4. height: The height of the bounding box.

Inside the box element we have a total of 68 part elements — these part elements represent the individual (x, y)-coordinates of the facial landmarks in the iBUG-300W dataset.

Notice that each part element has three attributes:

  1. name: The index/name of the specific facial landmark.
  2. x: The x-coordinate of the landmark.
  3. y: The y-coordinate of the landmark.

So, how do these landmarks map to specific facial structures?

The answer lies in the following figure:

Figure 5: Visualizing the 68 facial landmark coordinates from the iBUG 300-W dataset.

The coordinates in Figure 5 are 1-indexed so to map the coordinate name to our XML file, simply subtract 1 from the value (since our XML file is 0-indexed).

Based on the visualization, we can then derive which name coordinates maps to which facial structure:

  • The mouth can be accessed through points [48, 68].
  • The right eyebrow through points [17, 22].
  • The left eyebrow through points [22, 27].
  • The right eye using [36, 42].
  • The left eye with [42, 48].
  • The nose using [27, 35].
  • And the jaw via [0, 17].

Since we’re only interested in the eyes, we therefore need to parse out points [36, 48), again keeping in mind that:

  • Our coordinates are zero-indexed in the XML file
  • And the closing parenthesis “)” in [36, 48) is mathematical notation implying “non-inclusive”.

Now that we understand the structure of the iBUG-300W training file, we can move on to parsing out only the eye coordinates.

Building an “eyes only” shape predictor dataset

Let’s create a Python script to parse the iBUG-300W XML files and extract only the eye coordinates (which we’ll then train a custom dlib shape predictor on in the following section).

Open up the file and we’ll get started:

Lines 2 and 3 import necessary packages.

We’ll use two of Python’s built-in modules: (1) argparse  for parsing command line arguments, and (2) re  for regular expression matching. If you ever need help developing regular expressions, is a great tool and supports languages other than Python as well.

Our script requires two command line arguments:

  • --input : The path to our input data split XML file (i.e. from the iBug 300-W dataset).
  • --output : The path to our output eyes-only XML file.

Let’s go ahead and define the indices of our eye coordinates:

Our eye landmarks are specified on Line 17. Refer to Figure 5, keeping in mind that the figure is 1-indexed while Python is 0-indexed.

We’ll be training our custom shape predictor on eye locations; however, you could just as easily train an eyebrow, nose, mouth, or jawline predictor, including any combination or subset of these structures, by modifying the LANDMARKS list and including the 0-indexed names of the landmarks you want to detect.

Now let’s define our regular expression and load the original input XML file:

Our regular expression on Line 22 will soon enable extracting part elements along with their names/indexes.

Line 27 loads the contents of input XML file.

Line 28 opens our output XML file for writing.

Now we’re ready to loop over the input XML file to find and extract the eye landmarks:

Line 31 begins a loop over the rows  of the input XML file. Inside the loop, we perform the following tasks:

  • Determine if the current row contains a part element via regular expression matching (Line 34).
    • If it does not contain a part element, write the row back out to file (Lines 39 and 40).
    • If it does contain a part element, we need to parse it further (Lines 43-53).
      • Here we extract name attribute from the part.
      • And then check to see if the name exists in the LANDMARKS we want to train a shape predictor to localize. If so, we write the row back out to disk (otherwise we ignore the particular name as it’s not a landmark we want to localize).
  • Wrap up the script by closing our output XML file (Line 56).

Note: Most of our script was inspired by Luca Anzalone’s slice_xml function from their GitHub repo. A big thank you to Luca for putting together such a simple, concise script that is highly effective!

Creating our training and testing splits

Figure 6: Creating our “eye only” face landmark training/testing XML files for training a dlib custom shape predictor with Python.

At this point in the tutorial I assume you have both:

  1. Downloaded the iBUG-300W dataset from the “Downloading the iBUG 300-W dataset” section above
  2. Used the “Downloads” section of this tutorial to download the source code.

You can use the following command to generate our new training file by parsing only the eye landmark coordinates from the original training file:

Similarly, you can do the same to create our new testing file:

To verify that our new training/testing files have been created, check your iBUG-300W root dataset directory for the labels_ibug_300W_train_eyes.xml and labels_ibug_300W_test_eyes.xml files:

Notice that our *_eyes.xml  files are highlighted. Both of these files are significantly smaller in filesize than their original, non-parsed counterparts.

Implementing our custom dlib shape predictor training script

Our dlib shape predictor training script is loosely based on (1) dlib’s official example and (2) Luca Anzalone’s excellent 2018 article.

My primary contributions here are to:

  • Supply a complete end-to-end example of creating a custom dlib shape predictor, including:
    • Training the shape predictor on a training set
    • Evaluating the shape predictor on a testing set
  • Use the shape predictor to make predictions on custom images/video streams.
  • Provide additional commentary on the hyperparameters you should be tuning.
  • Demonstrate how to systematically tune your shape predictor hyperparameters to balance speed, model size, and accuracy (next week’s tutorial).

To learn how to train your own dlib shape predictor, open up the file in your project structure and insert the following code:

Lines 2-4 import our packages, namely dlib. The dlib toolkit is a package developed by PyImageConf 2018 speaker, Davis King. We will use dlib to train our shape predictor.

The multiprocessing library will be used to grab and set the number of threads/processes we will use for training our shape predictor.

Our script requires two command line arguments (Lines 7-12):

  • --training : The path to our input training XML file. We will use the eyes-only XML file generated by the previous two sections.
  • --model : The path to the serialized dlib shape predictor output file.

From here we need to set options (i.e., hyperparameters) prior to training the shape predictor.

While the following code blocks could be condensed into just 11 lines of code, the comments in both the code and in this tutorial provide additional information to help you both (1) understand the key options, and (2) configure and tune the options/hyperparameters for optimal performance.

In the remaining code blocks in this section I’ll be discussing the 7 most important hyperparameters you can tune/set when training your own custom dlib shape predictor. These values are:

  1. tree_depth
  2. nu
  3. cascade_depth
  4. feature_pool_size
  5. num_test_splits
  6. oversampling_amount
  7. oversampling_translation_jitter

We’ll begin with grabbing the default dlib shape predictor options:

From there, we’ll configure the tree_depth option:

Here we define the tree_depth, which, as the name suggests, controls the depth of each regression tree in the Ensemble of Regression Trees (ERTs). There will be 2^tree_depth leaves in each tree — you must be careful to balance depth with speed.

Smaller values of tree_depth will lead to more shallow trees that are faster, but potentially less accurate. Larger values of tree_depth will create deeper trees that are slower, but potentially more accurate.

Typical values for tree_depth are in the range [2, 8].

The next parameter we’re going to explore is nu, a regularization parameter:

The nu option is a floating-point value (in the range [0, 1]) used as a regularization parameter to help our model generalize.

Values closer to 1 will make our model fit the training data closer, but could potentially lead to overfitting. Values closer to 0 will help our model generalize; however, there is a caveat to the generalization power — the closer nu is to 0, the more training data you’ll need.

Typically, for small values of nu you’ll need 1000s of training examples.

Our next parameter is the cascade_depth:

A series of cascades is used to refine and tune the initial predictions from the ERTs — the cascade_depth will have a dramatic impact on both the accuracy and the output file size of your model.

The more cascades you allow for, the larger your model will become (but potentially more accurate). The fewer cascades you allow, the smaller your model will be (but could be less accurate).

The following figure from Kazemi and Sullivan’s paper demonstrates the impact that the cascade_depth has on facial landmark alignment:

Figure 7: The cascade_depth parameter has a significant impact on the accuracy of your custom dlib shape/landmark predictor model.

Clearly you can see that the deeper the cascade, the better the facial landmark alignment.

Typically you’ll want to explore cascade_depth values in the range [6, 18], depending on your required target model size and accuracy.

Let’s now move on to the feature_pool_size:

The feature_pool_size controls the number of pixels used to generate features for the random trees in each cascade.

The more pixels you include, the slower your model will run (but could potentially be more accurate). The fewer pixels you take into account, the faster your model will run (but could also be less accurate).

My recommendation here is that you should use large values for feature_pools_size if inference speed is not a concern. Otherwise, you should use smaller values for faster prediction speed (typically for embedded/resource-constrained devices).

The next parameter we’re going to set is the num_test_splits:

The num_test_splits parameter has a dramatic impact on how long it takes your model to train (i.e., training/wall clock time, not inference speed).

The more num_test_splits you consider, the more likely you’ll have an accurate shape predictor — but again, be cautious with this parameter as it can cause training time to explode.

Let’s check out the oversampling_amount next:

The oversampling_amount controls the amount of data augmentation applied to our training data. The dlib library causes data augmentation jitter, but it is essentially the same idea as data augmentation.

Here we are telling dlib to apply a total of 5 random deformations to each input image.

You can think of the oversampling_amount as a regularization parameter as it may lower training accuracy but increase testing accuracy, thereby allowing our model to generalize better.

Typical oversampling_amount values lie in the range [0, 50] where 0 means no augmentation and 50 is a 50x increase in your training dataset.

Be careful with this parameter! Larger oversampling_amount values may seem like a good idea but they can dramatically increase your training time.

Next comes the oversampling_translation_jitter option:

The oversampling_translation_jitter controls the amount of translation augmentation applied to our training dataset.

Typical values for translation jitter lie in the range [0, 0.5].

The be_verbose option simply instructs dlib to print out status messages as our shape predictor is training:

Finally, we have the num_threads parameter:

This parameter is extremely important as it can dramatically speed up the time it takes to train your model!

The more CPU threads/cores you can supply to dlib, the faster your model will train. We’ll default this value to the total number of CPUs on our system; however, you can set this value as any integer (provided it’s less-than-or-equal-to the number of CPUs on your system).

Now that our options are set, the final step is to simply call train_shape_predictor:

The dlib library accepts (1) the path to our training XML file, (2) the path to our output shape predictor model, and (3) our set of options.

Once trained the shape predictor will be serialized to disk so we can later use it.

While this script may have appeared especially easy, be sure to spend time configuring your options/hyperparameters for optimal performance.

Training the custom dlib shape predictor

We are now ready to train our custom dlib shape predictor!

Make sure you have (1) downloaded the iBUG-300W dataset and (2) used the “Downloads” section of this tutorial to download the source code to this post.

Once you have done so, you are ready to train the shape predictor:

The entire training process took 9m11s on my 3 GHz Intel Xeon W processor.

To verify that your shape predictor has been serialized to disk, ensure that eye_predictor.dat has been created in your directory structure:

As you can see, the output model is only 18MB — that’s quite the reduction in file size compared to dlib’s standard/default facial landmark predictor which is 99.7MB!

Implementing our shape predictor evaluation script

Now that we’ve trained our dlib shape predictor, we need to evaluate its performance on both our training and testing sets to verify that it’s not overfitting and that our results will (ideally) generalize to our own images outside the training set.

Open up the file and insert the following code:

Lines 2 and 3 indicate that we need both argparse  and dlib  to evaluate our shape predictor.

Our command line arguments include:

  • --predictor : The path to our serialized shape predictor model that we generated via the previous two “Training” sections.
  • --xml : The path to the input training/testing XML file (i.e. our eyes-only parsed XML files).

When both of these arguments are provided via the command line, dlib will handle evaluation (Line 16). Dlib handles computing the mean average error (MAE) between the predicted landmark coordinates and the ground-truth landmark coordinates.

The smaller the MAE, the better the predictions.

Shape prediction accuracy results

If you haven’t yet, use the “Downloads” section of this tutorial to download the source code and pre-trained shape predictor.

From there, execute the following command to evaluate our eye landmark predictor on the training set:

Here we are obtaining an MAE of ~3.63.

Let’s now run the same command on our testing set:

As you can see the MAE is twice as large on our testing set versus our training set.

If you have any prior experience working with machine learning or deep learning algorithms you know that in most situations, your training loss will be lower than your testing loss. That doesn’t mean that your model is performing badly — instead, it simply means that your model is doing a better job modeling the training data versus the testing data.

Shape predictors are especially interesting to evaluate as it’s not just the MAE that needs to be examined!

You also need to visually validate the results and verify the shape predictor is working as expected — we’ll cover that topic in the next section.

Implementing the shape predictor inference script

Now that we have our shape predictor trained, we need to visually validate that the results look good by applying it to our own example images/video.

In this section we will:

  1. Load our trained dlib shape predictor from disk.
  2. Access our video stream.
  3. Apply the shape predictor to each individual frame.
  4. Verify that the results look good.

Let’s get started.

Open up and insert the following code:

Lines 2-8 import necessary packages. In particular we will use imutils  and OpenCV ( cv2) in this script. Our VideoStream  class will allow us to access our webcam. The face_utils  module contains a helper function used to convert dlib’s landmark predictions to a NumPy array.

The only command line argument required for this script is the path to our trained facial landmark predictor, --shape-predictor .

Let’s perform three initializations:

Our initializations include:

  • Loading the face detector  (Line 19). The detector allows us to find a face in an image/video prior to localizing landmarks on the face. We’ll be using dlib’s HOG + Linear SVM face detector. Alternatively, you could use Haar cascades (great for resource-constrained, embedded devices) or a more accurate deep learning face detector.
  • Loading the facial landmark predictor  (Line 20).
  • Initializing our webcam stream (Line 24).

Now we’re ready to loop over frames from our camera:

Lines 31-33 grab a frame, resize it, and convert to grayscale.

Line 36 applies face detection using dlib’s HOG + Linear SVM algorithm.

Let’s process the faces detected in the frame by predicting and drawing facial landmarks:

Line 39 begins a loop over the detected faces. Inside the loop, we:

  • Take dlib’s rectangle object and convert it to OpenCV’s standard (x, y, w, h) bounding box ordering (Line 42).
  • Draw the bounding box surrounding the face (Line 43).
  • Use our custom dlib shape predictor  to predict the location of our landmarks (i.e., eyes) via Line 48.
  • Convert the returned coordinates to a NumPy array (Line 49).
  • Loop over the predicted landmark coordinates and draw them individually as small dots on the output frame (Line 53 and 54).

If you need a refresher on drawing rectangles and solid circles, refer to my OpenCV Tutorial.

To wrap up we’ll display the result!

Lines 57 displays the frame to the screen.

If the q  key is pressed at any point while we’re processing frames from our video stream, we’ll break and perform cleanup.

Making predictions with our dlib shape predictor

Are you ready to see our custom shape predictor in action?

If so, make sure you use the “Downloads” section of this tutorial to download the source code and pre-trained dlib shape predictor.

From there you can execute the following command:

As you can see, our shape predictor is both:

  • Correctly localizing my eyes in the input video stream
  • Running in real-time

Again, I’d like to call your attention back to the “Balancing shape predictor model speed and accuracy” section of this tutorial — our model is not predicting all of the possible 68 landmark locations on the face!

Instead, we have trained a custom dlib shape predictor that only localizes the eye regions. (i.e., our model is not trained on the other facial structures in the iBUG-300W dataset including i.e., eyebrows, nose, mouth, and jawline).

Our custom eye predictor can be used in situations where we don’t need the additional facial structures and only require the eyes, such as building an a drowsiness detector, building a virtual makeover application for eyeliner/mascara, or creating computer-assisted software to help disabled users utilize their computers.

In next week’s tutorial, I’ll show you how to tune the hyperparameters to dlib’s shape predictor to obtain optimal performance.

How do I create my own dataset for shape predictor training?

To create your own shape predictor dataset you’ll need to use dlib’s imglab tool. Covering how to create and annotate your own dataset for shape predictor training is outside the scope of this blog post. I’ll be covering it in a future tutorial here on PyImageSearch.

What’s next?

Are you interested in learning more about Computer Vision, OpenCV, and the Dlib library?

If so, you’ll want to take a look at the PyImageSearch Gurus course.

Inside PyImageSearch Gurus, you’ll find:

  • An actionable, real-world course on Computer Vision, Deep Learning, and OpenCV. Each lesson in PyImageSearch Gurus is taught in the same hands-on, easy-to-understand PyImageSearch style that you know and love.
  • The most comprehensive computer vision education online today. The PyImageSearch Gurus course covers 13 modules broken out into 168 lessons, with other 2,161 pages of content. You won’t find a more detailed computer vision course anywhere else online, I guarantee it.
  • A community of like-minded developers, researchers, and students just like you, who are eager to learn computer vision and level-up their skills.
  • Access to private course forums which I personally participate in nearly every day. These forums are a great way to get expert advice, both from me as well as the more advanced students.

To learn more about the course, and grab the course syllabus PDF, just use this link:

Send me the course syllabus and 10 free lessons!


In this tutorial, you learned how to train your own custom dlib shape/landmark predictor.

To train our shape predictor we utilized the iBUG-300W dataset, only instead of training our model to recognize all facial structures (i.e., eyes, eyebrows, nose, mouth, and jawline), we instead trained the model to localize just the eyes.

The end result is a model that is:

  • Accurate: Our shape predictor can accurately predict/localize the location of the eyes on a face.
  • Small: Our eye landmark predictor is smaller than the pre-trained dlib face landmark predictor (18MB vs. 99.7MB, respectively).
  • Fast: Our model is faster than dlib’s pre-trained facial landmark predictor as it predicts fewer locations (the hyperparameters to the model were also chosen to improve speed).

In next week’s tutorial, I’ll teach you how to systemically tune the hyperparameters to dlib’s shape predictor training procedure to balance prediction speed, model size, and localization accuracy.

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), just enter your email address in the form below!


If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL! Sound good? If so, enter your email address and I’ll send you the code immediately!

, , ,

38 Responses to Training a custom dlib shape predictor

  1. Risab Biswas December 16, 2019 at 11:03 am #

    Hi Adrian, Thanks a lot for this Tutorial. I have been looking for this from a long while and now it’s there in Pyimagesearch and it couldn’t have been any better. My question is that, if we see the flow of the entire algorithm, It First Detects Face -> And then it detects the Facial Landmarks. The Same is used for your Drowsiness Detection Algo. But I have tested it in scenarios where the Camera is Closer to the face and in those cases Face Detection Fails and eventually the facial landmarks detection also fails. Even if we have a high quality Image or Video the Algorithm doesn’t seems to work in that situation. Creating a algorithm where it detects only the eye region and then detecting the Eye Landmark points will give the most Optimum Results. Kindly let me know your thoughts. Looking forward to hear from you. Thanks Again for all the Great Work! 🙂

    • Adrian Rosebrock December 16, 2019 at 12:52 pm #

      I’m glad you enjoyed the tutorial, Risab.

      To address your question, you are correct that this is a two-stage process — an object must be detected before any landmarks can be computed.

      That said, in your specific example you mentioned the eyes being very close to the camera. In that case, train a very simple HOG + Linear SVM or Haar cascade to detect the eyes, then apply the eye landmark detector to those eye regions.

      Practical Python and OpenCV includes an eye detector Haar cascade + code that you can use to get yourself started.

      • Risab Biswas December 20, 2019 at 4:56 am #

        Thanks a lot Adrian for the Reply and the Insights. I will do the things accordingly.

        Also if you could tell me that how we can create our own data set for the landmark points it would be great, particularly Can you please suggest an annotation tool for annotating the landmarks? Say for example right now we are only able to detect 6 landmark points around each eyes, but if I would like to get more landmark points, say 10 on each eyes.

        Looking forward for your valuable suggestions 🙂

        • Adrian Rosebrock December 20, 2019 at 7:05 am #

          I would suggest you use dlib’s “imglab” tool for annotation.

          • Risab Biswas December 20, 2019 at 1:20 pm #

            Great! Thanks Adrian! 🙂

          • Adrian Rosebrock December 26, 2019 at 9:40 am #

            You’re welcome!

    • Omkar Dalvi January 4, 2020 at 11:36 pm #

      Hey,we are trying a similar project in which we are trying to detect eye motion based on the landmarks. Have you tried the method suggested by Adrian of using eye detector dlib on the close up of eye video in real time. Please let me know if this works. Thanks

  2. Pranav December 16, 2019 at 12:03 pm #

    Can it be done in a raspberry pi 3b+

    • Adrian Rosebrock December 16, 2019 at 12:49 pm #

      You can deploy trained dlib shape predictors to the RPi but you cannot realistically train them on the RPi. The RPi is too limited in terms of power and resources.

  3. David December 16, 2019 at 3:51 pm #


    Very thorough guide, I’ve just been playing with similar stuff with dlib recently. Do you mind me a question on detecting certain shaped objects and fitting landmarks on it…?

    In example a snail from the side view, or a shark even, as you get to look at it from the side it could either be facing left or right.

    having trained both left facing and right facing ones as the same object, with landmark points mirrored where necessary, it seems to be mixing it up quite a bit (disabled left right img flips of course)

    Would you approach this by pretending the left facing set and its keypoints as one object, and the right facing ones as another calling it snail_a, snail_b or something?


    • Adrian Rosebrock December 18, 2019 at 9:40 am #

      I would do something along the lines of the following:

      1. Train a generic “shark detector” that can detect left or right views of the shark OR train two detectors, one for left view, one for right view.
      2. If you trained two detectors, horizontally flip your the left shark so it always looks like a right shark. If you used a single detector you’ll need a classifier here to determine the view of the shark.
      3. At that point you can apply your landmark predictor.

      • David Lipcsey December 20, 2019 at 11:42 am #

        Thanks for the clarification, appreciated!
        Keep up.

        • Adrian Rosebrock December 26, 2019 at 9:40 am #

          Thanks David!

  4. Ajinkya December 17, 2019 at 12:34 am #


    How can I train dlib shape predictor to identify shape of a credit card ?


    • Adrian Rosebrock December 18, 2019 at 9:38 am #

      First you need to create your training set. Your training set should consist of four (x, y)-coordinate pairs which would be the four corners of the credit card itself.

  5. Itai December 17, 2019 at 2:35 am #

    Great tutorial Adrian, thank you so much.
    If I can make a request, if you can make more tutorials that leverages also 3D data (such as depth maps / point clouds).
    some of us working with depth cameras and would like to know more how can we use the 3D data in a way that will make our RGB model more accurate

    • Adrian Rosebrock December 18, 2019 at 9:37 am #

      Thanks for the suggestion, Itai.

  6. Kevin December 17, 2019 at 9:43 pm #

    Hi, Adrian:

    Thanks for the great post, again!

    I run into an error while executing

    Training with lambda_param: 0.1
    Training with 50 split tests.
    Intel MKL FATAL ERROR: Cannot load or

    It seems I missed these two supporting libraries. How to fix this?


  7. Safakat Rahman December 18, 2019 at 2:46 am #

    Hello.I am using dlib shape predictor to detect landmarks. But the landmarks are not stable and shaking too much. How to fix that?
    Thanx in advance

    • Adrian Rosebrock December 18, 2019 at 9:37 am #

      Take a look at optical flow to help stabilize the landmarks.

  8. Filip Norys December 18, 2019 at 5:29 am #

    Very nice tutorial, thanks for that! I think it would be nice to present how to prepare also you own set of training data in dlib fashion.

    • Adrian Rosebrock December 18, 2019 at 9:37 am #

      Thanks for the suggestion. It’s definitely a topic I would like to cover.

  9. Thanh-Sang Nguyen December 26, 2019 at 3:22 am #

    I wonder if this technique can help us to do the task called “graph digitizer”. That means the model can extract numerical of plotted graph with any format. How do you think about it, let me know your opinion. Thank you so much!

    • Adrian Rosebrock December 26, 2019 at 9:41 am #

      In general I don’t recommend trying to use computer vision algorithms for that type of task. Try to get access to the raw data instead as it will make processing far easier than trying to use computer vision algorithms to extract the graph data.

  10. Thanh-Sang Nguyen December 27, 2019 at 1:05 am #

    Thank you for your reply. My task only has image graph instead of raw data, so that I always look for computer vision solution to extract raw data from image graph. So, you really think that this computer vision technique can not help us to extract raw data from graph, aren’t you? Or if you think it might be possible?

    I really appreciate your reply.
    Thank you!

    • Adrian Rosebrock January 2, 2020 at 8:42 am #

      If possible you should try to access the raw data used to generate the graph. Trying to extract lines/plots from a figure can be very challenging and tedious. It’s far better to go right to the source of the data.

  11. Julien December 29, 2019 at 6:52 pm #

    Hi Adrian

    First, thanks a lot for all the very nice tutorials which where useful for my project 🙂

    Regarding training, I tried to train the shape predictor by myself, but it seems that the python script consumes too much memory and then the python process is getting killed every time after some while the training script has started. Do you know this problem and do you have any recommendations for solving it?



    • Adrian Rosebrock January 2, 2020 at 8:42 am #

      See my reply to Ahsan Raza

  12. Ahsan Raza January 2, 2020 at 7:18 am #

    Hi Adrian,
    I ran into an error

    File “”, line 78, in
    dlib.train_shape_predictor(args[“training”], args[“model”], options)
    MemoryError: bad allocation

    I have 16gb ram in my system

    how can i solve this issue

    • Adrian Rosebrock January 2, 2020 at 8:41 am #

      Your machine doesn’t have enough RAM, hence the MemoryError. Either:

      1. Add more RAM to your machine
      2. Decrease the size of the training set

  13. William Stevenson January 2, 2020 at 5:06 pm #

    Hi Adrian

    Is there a way to release the camera? My Logitech C920 shows a blue ring around it when it’s in use. vs.stop() does not free up the camera. Sometimes the system hangs on to the camera and the program will not start again.

    PS Most times the code finds my eyes even though I wear glasses.


  14. Ahsan Raza January 3, 2020 at 2:51 am #

    Hi Adrain,
    Thank you for the reply i solved the problem by closing all the other program running
    in the system

    • Adrian Rosebrock January 16, 2020 at 10:19 am #

      Congrats on resolving the issue!

  15. Mike January 4, 2020 at 7:26 pm #

    Hi! Thanks a lot for this guide! I’ve noticed that when the face is rotated towards one of the sides so it’s not in the front position, model becomes not very accurate. Eyes predictions can not really move along with the face till the end. Do you think this problem can be addressed somehow?
    And also what do you think is the minimum number of training images for the model? I’ve tried training with like 200 pictures but haven’t got good results so far.

  16. quang nhat tran January 13, 2020 at 9:33 pm #


    in my case, the built-in camera on my laptop could not working.
    it return line this

    [INFO] loading facial landmark predictor…
    [INFO] camera sensor warming up…
    [1] 9169 abort python –shape-predictor eye_predictor.dat

    anyone has that issue ?

    • Adrian Rosebrock January 16, 2020 at 10:19 am #

      Try checking your RAM usage. It sounds like your machine might be running out of RAM.

  17. AI_Developer_OZ January 23, 2020 at 2:53 am #

    Hi Adrian, Appreciations to your works in this stream. Is there any possibilities to annotate human head and combine it with dlib. Like, the detection must include the boundaries of head along with the 68pts. Please share your views on this and suggest any alternative ideas to detect the head part along with face. Thanks in advance.

    • Adrian Rosebrock January 23, 2020 at 9:12 am #

      I’m not sure what you mean by “boundaries of head”. Could you elaborate?

Before you leave a comment...

Hey, Adrian here, author of the PyImageSearch blog. I'd love to hear from you, but before you submit a comment, please follow these guidelines:

  1. If you have a question, read the comments first. You should also search this page (i.e., ctrl + f) for keywords related to your question. It's likely that I have already addressed your question in the comments.
  2. If you are copying and pasting code/terminal output, please don't. Reviewing another programmers’ code is a very time consuming and tedious task, and due to the volume of emails and contact requests I receive, I simply cannot do it.
  3. Be respectful of the space. I put a lot of my own personal time into creating these free weekly tutorials. On average, each tutorial takes me 15-20 hours to put together. I love offering these guides to you and I take pride in the content I create. Therefore, I will not approve comments that include large code blocks/terminal output as it destroys the formatting of the page. Kindly be respectful of this space.
  4. Be patient. I receive 200+ comments and emails per day. Due to spam, and my desire to personally answer as many questions as I can, I hand moderate all new comments (typically once per week). I try to answer as many questions as I can, but I'm only one person. Please don't be offended if I cannot get to your question
  5. Do you need priority support? Consider purchasing one of my books and courses. I place customer questions and emails in a separate, special priority queue and answer them first. If you are a customer of mine you will receive a guaranteed response from me. If there's any time left over, I focus on the community at large and attempt to answer as many of those questions as I possibly can.

Thank you for keeping these guidelines in mind before submitting your comment.

Leave a Reply