How-To: Multi-GPU training with Keras, Python, and deep learning

Using Keras to train deep neural networks with multiple GPUs (Photo credit:

Keras is undoubtedly my favorite deep learning + Python framework, especially for image classification.

I use Keras in production applications, in my personal deep learning projects, and here on the PyImageSearch blog.

I’ve even based over two-thirds of my new book, Deep Learning for Computer Vision with Python on Keras.

However, one of my biggest hangups with Keras is that it can be a pain to perform multi-GPU training. Between the boilerplate code and configuring TensorFlow it can be a bit of a process…

…but not anymore.

With the latest commit and release of Keras (v2.0.9) it’s now extremely easy to train deep neural networks using multiple GPUs.

In fact, it’s as easy as a single function call!

To learn more about training deep neural networks using Keras, Python, and multiple GPUs, just keep reading.

Looking for the source code to this post?
Jump right to the downloads section.

How-To: Multi-GPU training with Keras, Python, and deep learning

When I first started using Keras I fell in love with the API. It’s simple and elegant, similar to scikit-learn. Yet it’s extremely powerful, capable of implementing and training state-of-the-art deep neural networks.

However, one of my biggest frustrations with Keras is that it could be a bit non-trivial to use in multi-GPU environments.

If you were using Theano, forget about it — multi-GPU training wasn’t going to happen.
TensorFlow was a possibility, but it could take a lot of boilerplate code and tweaking to get your network to train using multiple GPUs.

I preferred using the mxnet backend (or even the mxnet library outright) to Keras when performing multi-GPU training, but that introduced even more configurations to handle.

All of that changed with François Chollet’s announcement that multi-GPU support using the TensorFlow backend is now baked in to Keras v2.0.9. Much of this credit goes to @kuza55 and their keras-extras repo.

I’ve been using and testing this multi-GPU function for almost a year now and I’m incredibly excited to see it as part of the official Keras distribution.

In the remainder of today’s blog post I’ll be demonstrating how to train a Convolutional Neural Network for image classification using Keras, Python, and deep learning.

The MiniGoogLeNet deep learning architecture

Figure 1: The MiniGoogLeNet architecture is a small version of it’s bigger brother, GoogLeNet/Inception. Image credit to @ericjang11 and @pluskid.

In Figure 1 above we can see the individual convolution (left), inception (middle), and downsample (right) modules, followed by the overall MiniGoogLeNet architecture (bottom), constructed from these building blocks. We will be using the MiniGoogLeNet architecture in our multi-GPU experiments later in this post.

The Inception module in MiniGoogLenet is a variation of the original Inception module designed by Szegedy et al.

I first became aware of this “Miniception” module from a tweet by @ericjang11 and @pluskid where they beautifully visualized the modules and associated MiniGoogLeNet architecture.

After doing a bit of research, I found that this graphic was from Zhang et al.’s 2017 publication, Understanding Deep Learning Requires Re-Thinking Generalization.

I then proceeded to implement the MiniGoogLeNet architecture in Keras + Python — I even included it as part of Deep Learning for Computer Vision with Python.

A full review of the MiniGoogLeNet Keras implementation is outside the scope of this blog post, so if you’re interested in how the network works (and how to code it), please refer to my book.

Otherwise, you can use the “Downloads” section at the bottom of this blog post to download the source code.

Training a deep neural network with Keras and multiple GPUs

Let’s go ahead and get started training a deep learning network using Keras and multiple GPUs.

To start, you’ll want to ensure that you have Keras 2.0.9 (or greater) installed and updated in your virtual environment (we use a virtual environment named dl4cv  inside my book):

From there, open up a new file, name it , and insert the following code:

If you’re using a headless server, you’ll want to configure the matplotlib backend on Lines 3 and 4 by uncommenting the lines. This will enable your matplotlib plots to be saved to disk. If you are not using a headless server (i.e., your keyboard + mouse + monitor are plugged in to your system, you can keep the lines commented out).

From there we import our required packages for this script.

Line 7 imports the MiniGoogLeNet from my pyimagesearch  module (included with the download available in the “Downloads” section).

Another notable import is on Line 13 where we import the CIFAR10 dataset. This helper function will enable us to load the CIFAR-10 dataset from disk with just a single line of code.

Now let’s parse our command line arguments:

We use argparse  to parse one required and one optional argument on Lines 20-25:

  • --output : The path to the output plot after training is complete.
  • --gpus : The number of GPUs used for training.

After loading the command line arguments, we store the number of GPUs as G  for convenience (Line 28).

From there, we initialize two important variables used to configure our training process, followed by defining poly_decay , a learning rate schedule function equivalent to Caffe’s polynomial learning rate decay:

We set  NUM_EPOCHS = 70  — this is the number of times (epochs) our training data will pass through the network (Line 32).

We also initialize the learning rate INIT_LR = 5e-3 , a value that was found experimentally in previous trials (Line 33).

From there, we define the poly_decay  function which is the equivalent of Caffe’s polynomial learning rate decay (Lines 35-46). Essentially this function updates the learning rate during training, effectively reducing it after each epoch. Setting the  power = 1.0  changes the decay from polynomial to linear.

Next we’ll load our training + testing data and convert the image data from integer to float:

From there we apply mean subtraction to the data:

On Line 56, we calculate the mean of all training images followed by Lines 57 and 58 where we subtract the mean from each image in the training and testing sets.

Then, we perform “one-hot encoding”, an encoding scheme I discuss in more detail in my book:

One-hot encoding transforms categorical labels from a single integer to a vector so we can apply the categorical cross-entropy loss function. We’ve taken care of this on Lines 61-63.

Next, we create a data augmenter and set of callbacks:

On Lines 67-69 we construct the image generator for data augmentation.

Data augmentation is covered in detail inside the Practitioner Bundle of Deep Learning for Computer Vision with Python; however, for the time being understand that it’s a method used during the training process where we randomly alter the training images by applying random transformations to them.

Because of these alterations, the network is constantly seeing augmented examples — this enables the network to generalize better to the validation data while perhaps performing worse on the training set. In most situations these trade off is a worthwhile one.

We create a callback function on Line 70 which will allow our learning rate to decay after each epoch — notice our function name, poly_decay .

Let’s check that GPU variable next:

If the GPU count is less than or equal to one, we initialize the model  via the .build  function (Lines 73-76), otherwise we’ll parallelize the model during training:

Creating a multi-GPU model in Keras requires some bit of extra code, but not much!

To start, you’ll notice on Line 84 that we’ve specified to use the CPU (rather than the GPU) as the network context.

Why do we need the CPU?

Well, the CPU is responsible for handling any overhead (such as moving training images on and off GPU memory) while the GPU itself does the heavy lifting.

In this case, the CPU instantiates the base model.

We can then call the multi_gpu_model  on Line 90. This function replicates the model from the CPU to all of our GPUs, thereby obtaining single-machine, multi-GPU data parallelism.

When training our network images will be batched to each of the GPUs. The CPU will obtain the gradients from each GPU and then perform the gradient update step.

We can then compile our model and kick off the training process:

On Line 94 we build a Stochastic Gradient Descent (SGD) optimizer.

Subsequently, we compile the model with the SGD optimizer and a categorical crossentropy loss function.

We’re now ready to train the network!

To initiate the training process, we make a call to model.fit_generator  and provide the necessary arguments.

We’d like a batch size of 64 on each GPU so that is specified by  batch_size=64 * G  .

Our training will continue for 70 epochs (which we specified previously).

The results of the gradient update will be combined on the CPU and then applied to each GPU throughout the training process.

Now that training and testing is complete, let’s plot the loss/accuracy so we can visualize the training process:

This last block simply uses matplotlib to plot training/testing loss and accuracy (Lines 112-121), and then saves the figure to disk (Line 124).

If you would like more to learn more about the training process (and how it works internally), please refer to Deep Learning for Computer Vision with Python.

Keras multi-GPU results

Let’s check the results of our hard work.

To start, grab the code from this lesson using the “Downloads” section at the bottom of this post. You’ll then be able to follow along with the results

Let’s train on a single GPU to obtain a baseline:

Figure 2: Experimental results from training and testing MiniGoogLeNet network architecture on CIFAR-10 using Keras on a single GPU.

For this experiment, I trained on a single Titan X GPU on my NVIDIA DevBox. Each epoch took ~63 seconds with a total training time of 74m10s.

I then executed the following command to train with all four of my Titan X GPUs:

Figure 3: Multi-GPU training results (4 Titan X GPUs) using Keras and MiniGoogLeNet on the CIFAR10 dataset. Training results are similar to the single GPU experiment while training time was cut by ~75%.

Here you can see the quasi-linear speed up in training: Using four GPUs, I was able to decrease each epoch to only 16 seconds. The entire network finished training in 19m3s.

As you can see, not only is training deep neural networks with Keras and multiple GPUs easy, it’s also efficient as well!

Note: In this case, the single GPU experiment obtained slightly higher accuracy than the multi-GPU experiment. When training any stochastic machine learning model, there will be some variance. If you were to average these results out across hundreds of runs they would be (approximately) the same.


In today’s blog post we learned how to use multiple GPUs to train Keras-based deep neural networks.

Using multiple GPUs enables us to obtain quasi-linear speedups.

To validate this, we trained MiniGoogLeNet on the CIFAR-10 dataset.

Using a single GPU we were able to obtain 63 second epochs with a total training time of 74m10s.

However, by using multi-GPU training with Keras and Python we decreased training time to 16 second epochs with a total training time of 19m3s.

Enabling multi-GPU training with Keras is as easy as a single function call — I recommend you utilize multi-GPU training whenever possible. In the future I imagine that the multi_gpu_model  will evolve and allow us to further customize specifically which GPUs should be used for training, eventually enabling multi-system training as well.

Ready to take a deep dive into deep learning? Follow my lead.

If you’re interested in learning more about deep learning (and training state-of-the-art neural networks on multiple GPUs), be sure to take a look at my new book, Deep Learning for Computer Vision with Python.

Whether you’re just getting started with deep learning or you’re already a seasoned deep learning practitioner, my new book is guaranteed to help you reach expert status.

To learn more about Deep Learning for Computer Vision with Python (and grab your copy), click here.


If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 11-page Resource Guide on Computer Vision and Image Search Engines, including exclusive techniques that I don’t post on this blog! Sound good? If so, enter your email address and I’ll send you the code immediately!

, , ,

38 Responses to How-To: Multi-GPU training with Keras, Python, and deep learning

  1. Anthony The Koala October 30, 2017 at 12:18 pm #

    Dear Dr Adrian,
    Thank you for your tutorial. I have a few questions which point to the same thing – how to construct & wire up multiple processors and how to input and output to and from multiple processors for the purposes of experimenting with deep learning and keras.

    Other questions related to the main question on multiple processors and how to input and output to and from multiple processors.

    1. What kind of processors are used if you had to rig up an array of GPUs – for example do you have purchase a number of NVIDIA GPUs each one connected to a PCI bus – apologies for sounding ‘naive’ here.
    2. Can multiple RPi’s be rigged up in an array – how?
    3. Are multi-core CPUs used in PCs the same as multiple processors being ‘rigged’ up for multiprocessors?

    Thank you
    Anthony of Sydney Australia

    • Adrian Rosebrock October 30, 2017 at 1:11 pm #

      Hi Anthony — Thanks for your comment.

      (1) If you take a look at my NVIDIA DIGITS DevBox, you can see the specs which include 4 Titan X GPUs connected via PCI Express.

      (2) I wouldn’t advise spending money on a bunch of Raspberry Pis and connecting them up in an array, but people have done it as in this 2013 article.

      (3) I think you’re asking if there’s a difference between multi-core CPUs and having a multiprocessor system. There is a difference — here’s a brief explanation.

      • Anthony The Koala October 31, 2017 at 5:25 am #

        Dear Dr Adrian,
        Thank you for the reply in regards the different kinds of processors and pointing me to different kinds of processors. My summary is:
        (1) NVIDIA DIGITS DevBox – it is a self-contained i7 PC with 4 x GPUs connected on a BUS costing $15000 (US). I understand that the graphics processors which were originally designed for graphics displays’ calculations lends itself also for deep learning computations.
        (2) Using an array of RPis as a supercomputer. In this example article, the computers were interconnected via network cards and were for distributive super computing. For the latter, you want your array to be arranged for parallel computing, for which GPUs (see issue (1)) are constructed and parallel computing is ideal for deep learning.

        However, this clever person in England (2017) designed a supercomputer using RPis This still required inter-connecting the RPis via network cables and a switch. It also requires each RPi having its unique IP address and communication is via the Message Passing Interface (‘MPI’) communication protocol which has a python implementation. This article (2015) also explains the principles of MPI applied to the RPi, .

        Yet to find out comparison data between parallel computers using RPis and commercially available machines (NVidia) and services (Amazon).

        (3) Difference between multicore CPUs and a multiprocessor system. The former (multicore) means two or more processors on a single die (chip) while the latter (multiprocessor) is two or more separate CPUs. Multicore allows for faster caching speeds whereas multiprocessor are independently operating on the same motherboard.

        Even though multicore and multiprocessor allow concurrent execution, it is not the same as parallel execution. I stand corrected on this.

        Thank you,
        Anthony of Sydney Australia

        • Stephen Borstelmann MD October 31, 2017 at 1:22 pm #

          Anthony et. al. –

          I’ve seen the raspberry pi supercomputer video & its cute and neat for a parallel computing proof of concept but unless you are a hardware and coding master, I’d forget it.

          Here’s my attempt to make something similar to a DIGITS box. Its far less expensive – but also has some limitations in capabilities, particuarly along the bus and ease of use/setup. You get what you pay for, but if you’re looking just to experiment and expand, it might fit your needs:

          It currently uses one 1080Ti GPU for running Tensorflow, Keras, and pytorch under Ubuntu 16.04LTS but can easily be expanded to 3, possibly 4 GPU’s.

          Puget Systems also builds similar & installs software for those not inclined to do-it-yourself.

  2. Bohumír Zámečník October 30, 2017 at 3:12 pm #

    Hi Adrian,
    thanks for a beautiful and very practical tutorial. Just a correction – the multi_gpu_model() function is yet to be released in 2.0.9, it was added on 11 Oct, whereas 2.0.8 was released on 25 Aug. It means until 2.0.9 we need Keras from master.

    I’d be really interested how you achieved so perfect speedup (more than 95% efficiency). I’ve been experimenting with multi-GPU training in Keras with TensorFlow since summer and in Keras got efficiency around 75-85% with ResNet50/imagenet-synth and much better with optimized tf_cnn_benchmark. Coincidentaly, today we also published an article our experience ( We tried replication code from kuza55, avolkov1 and fchollet. Our conclusion was that Keras compared to tf_cnn_benchmark is lacking asynchronous prefetching of inputs (to TF memory on CPU and from CPU to GPU). It seems your model was on CIFAR10 with not too big batch size. What was roughly the number of parameters? Maybe in your case the the inputs were small compared to computation and the machine was benefitting from 4 16-channel PCIe slots. Could you try comparing our benchmark on your machine (

    We’d also like to try new packages like Horovod and Tensorpack, but anyway I’m working on async prefetch using StagingArea.


    Kind regards,


    • Adrian Rosebrock October 31, 2017 at 7:48 am #

      Hi Bohumír! Thank you for the clarification on the Keras version numbers. I have updated the blog post to report the correct Keras v2.0.9 version number.

      I’m using my NVIDIA DevBox which has been built specifically for deep learning, optimizing across the PCIe bus, processor, GPUs, etc. NVIDIA did a really great job building the machine. I’m personally not a hardware person and I don’t particularly enjoy working with hardware. I tend to focus more on the software side of things. With the DevBox things “just work”.

      I’m very busy with other projects/blog posts right now, but please send me a message and we can chat more about the benchmark.

  3. Segovia October 30, 2017 at 4:00 pm #

    I think when tensorflow is used as backend, all GPUs will be used by default, right? Thanks for your great tutorials!

    • Adrian Rosebrock October 31, 2017 at 7:43 am #

      TensorFlow will allocate all GPUs but the network will not be trained on all GPUs.

  4. Kenny October 30, 2017 at 10:48 pm #

    Cool post yet again, Adrian 🙂

    • Adrian Rosebrock October 31, 2017 at 7:41 am #

      Thanks Kenny!

  5. CT October 31, 2017 at 3:23 am #

    Dear Dr Adrain,

    I downloaded your code and get the following error while running it.

    Using TensorFlow backend.
    Traceback (most recent call last):

    ModuleNotFoundError: No module named ‘keras.utils.training_utils’

    I checked that my keras version is 2.0.8



    • Adrian Rosebrock October 31, 2017 at 7:40 am #

      Please see my reply to “GKS”. The correct Keras version number is their development branch, 2.0.9. Install the 2.0.9 branch and it will work 🙂

      • CT October 31, 2017 at 10:26 pm #

        Hi Dr Adrian,

        Thanks, after I look around on the internet, found some discussion on this and I used the latest Keras source from the Git. Yes, it works now however when I check the version using pip freeze it still shows 2.0.8.

        Another question is that I two different version of Nvidia card a 1080Ti and a GTX 780. When I use both GPU for training, it is slower than a single one, GTX 1080ti. Is this expected?

        Thanks again.

        • Adrian Rosebrock November 2, 2017 at 2:33 pm #

          The 2.0.9 branch hasn’t been officially released it (it’s the development branch for the next release which will be 2.0.9).

          As for your second question, it’s entirely possible that the training process would be slower. Your 1080 Ti is significantly faster than your 780 so your Ti quickly processes the batch while your 780 is far behind. The CPU is then stuck waiting for the 780 to catch up before the weights can be updated and then the next batch sent to the two GPUs.

  6. GKS October 31, 2017 at 6:26 am #

    I believe that Keras 2.0.8 doesn’t contain multi_gpu_model as the release was on August 25th while the function has been added sometime in October. Pypi installation should cause importError/

    • Adrian Rosebrock October 31, 2017 at 7:38 am #

      I misspoke, the actual Keras version number is the current development branch 2.0.9. I have updated the blog post to reflect this. A big thank you to Bohumir for clarifying this.

  7. Davide October 31, 2017 at 7:34 am #

    Keras 2.0.8 does not include in utils, sadly.

    • Adrian Rosebrock October 31, 2017 at 7:38 am #

      Please see my reply to GKS.

      • Davide October 31, 2017 at 8:51 am #

        Yeah, didn’t refresh the post when I finished reading it 🙂

  8. Sunny October 31, 2017 at 8:10 am #

    Hi, Thanks for your posts.I also used this technique to speed up the training.
    However, the parallelized model cannot be saved like the original model. There is no way I can save it and I’m not able to perform reinforced training in future. Do you have any solution for this bug?

    • Adrian Rosebrock October 31, 2017 at 8:17 am #

      Hi Sunny — I’m not sure what you mean by “the parallelized model cannot be saved like the original model”. Can you please elaborate?

      • Guangzhe Cui November 2, 2017 at 3:04 am #

        I can’t using either

        • Adrian Rosebrock November 2, 2017 at 2:14 pm #

          There should be an internal model representation. I would suggest doing:


          And examining the output. You should be able to find the internal model object that can be serialized using there. I’ll test this out myself the next time I’m at my workstation.

          • Sunny November 4, 2017 at 8:02 am #

            Hi, thanks for the suggestion. Seems like I’ve found the solution. Just compile the base model, then transfer the trained weights of GPU model back to base model itself, then it was able to be saved like usual, walla!
            >>> autoencoder.compile(optimizer=’adadelta’, loss=’binary_crossentropy’) # since the GPU model is compiled, now only compile the base model
            >>> output = autoencoder.predict(img) # the output will be a mess since only the GPU model is trained, not the base model
            >>> output = parallel_autoencoder.predict(img) # the output is a clear image from well-trained GPU model
            >>> autoencoder.set_weights(parallel_autoencoder.get_weights()) # transfer the trained weights from GPU model to base model
            >>> output = autoencoder.predict(img) # perform the prediction again and the result is similar to the GPU model
            >>>‘CAE.h5’) # now the mode can be saved with transferred weights from GPU model.

            Hope it helps.

          • Adrian Rosebrock November 6, 2017 at 10:41 am #

            Thank you for sharing, Sunny!

  9. Adam October 31, 2017 at 9:55 pm #

    Dear Dr Adrian

    Great post again!

    I have another problem in training parallelized model.
    The weight trained in the GPUs can not be loaded in the CPU environment.
    (The model’s summary is different from the CPU)

    Do you have any solutions for this problem?

    • Adrian Rosebrock November 2, 2017 at 2:34 pm #

      See my reply to “Sunny”. I believe this can be resolved by finding the internal model representation via dir(model) and extracting this object.

      • Adam November 5, 2017 at 7:57 pm #

        Thank you for your reply!I will try it!
        I am looking forward to reading your another post about this problem.

  10. yunxiao November 5, 2017 at 3:43 am #

    Hi Adrian

    Thank you for your excellent post. Training and validation using the multi_gpu_model was smooth for me. But when I try to do test (parallel_model.predict()) it throws an error:

    InternalError: CUB segmented reduce errorinvalid configuration argument [……]

    My understanding is that test is only forward pass for the network so maybe I should use model.predict() instead? But since I’m on a quite big dataset by doing model.predict() would take ~300 hours to complete…I think there must be something wrong with my doing but I can’t seem to figure it out… Do you have any suggestions?

    • Adrian Rosebrock November 6, 2017 at 10:37 am #

      Please see my reply to “Sunny” — I think there is an internal model object that exposes the .predict function. I’m not sure if you can directly use multi-GPUs for prediction in this specific instance.

  11. Simon Walsh November 9, 2017 at 7:24 am #

    Has anyone encountered this error TypeError: can’t pickle module objects when using following this multi-gpu implementation? This is a real pain – it also occurs when trying to save best weights during training.

    • Adrian Rosebrock November 9, 2017 at 7:28 am #

      Hey Simon, please see the conversation between “Sunny” and myself. The parallel model is a wrapper around the model instance that is transferred to the GPUs so the object can’t be directly called when using or the callbacks used to save the best weights. If you wanted to save the best weights you would need to code your own custom callback to handle this use case (since you need to transfer the weights and/or access the internal model). I’m sure this will become easier to use in future versions of Keras, keep in mind this is the first time multi-GPU training is included in an official release of Keras. I’ll also try to do a blog post on how to access the internal model object as well.

      • Simon Walsh November 9, 2017 at 7:31 am #

        Thanks Adrian – if you can have a look at it and provide some guidance that would be great – I am not fully sure I understand (but thats me, not your explanation!)

        • Adrian Rosebrock November 13, 2017 at 2:29 pm #

          I’ll be taking a look and likely doing an entirely separate blog post since this seems to be a common issue.

      • Jeff November 10, 2017 at 11:28 am #

        Hi Adrian, great post, thanks for sharing!

        I’ve been playing around with multi_gpu_model for a couple weeks now and finally got it to work recently (had to patch a bug in tensorflow for this to happen – quirk of my model).

        I’ve been using the save_weights() method on the parallel model after training and then later creating the base model and using load_weights() with the ‘by_name’ parameter equal to True. It loads fine, but the predictions I get are all roughly between 0.45 and 0.55 even though the training results would suggest pretty high accuracy. Also, each time I do this, the metrics change slightly even though it’s on the same test data. This seems to suggest to me that maybe weights aren’t being loaded but are perhaps randomly initialized?

        I tried using Sunny’s suggestion:


        But I always get a shape mismatch error. After inspecting the results of the get_weights() methods on both models, I noticed they more or less have the same internal shapes but the ordering is different. I’m somewhat new to all this stuff so I was wondering if you might have any suggestions on what might be going on and how to investigate this further?

        • Adrian Rosebrock November 13, 2017 at 2:11 pm #

          Hi Jeff — thanks for the comment. I’m not sure what the exact issue is off the top of my head, I’ll need to play with the multi-GPU training function further, in particular serializing the weights.

  12. michael reynolds November 14, 2017 at 9:13 am #

    Hi Adrian – I am looking at a project with multi V100 GPUs. Are there any compatibility issues or special setup required that you are aware of with that hardware configuration and your ImageNet Bundle?

    • Adrian Rosebrock November 15, 2017 at 1:01 pm #

      Wow, that’s awesome that you’ll be using multiple V100 GPUs! There are no compatibly issues at all. The ImageNet Bundle of Deep Learning for Computer Vision with Python will work just fine on your GPUs.

Leave a Reply