How to create a deep learning dataset using Google Images

PyImageSearch reader José asks:

Hey Adrian, thanks for putting together Deep Learning for Computer Vision with Python. This is by far the best resource I’ve seen for deep learning.

My question is this:

I’m working on a project where I need to classify the scenes of outdoor photographs into four distinct categories: cities, beaches, mountains, and forests.

I’ve found a small dataset (~100 images per class), but my models are quick to overfit and far from accurate.

I’m confident I can solve this project, but I need more data.

What do you suggest?

José has a point — without enough training data, your deep learning and machine learning models can’t learn the underlying, discriminative patterns required to make robust classifications.

Which begs the question:

How in the world do you gather enough images when training deep learning models?

Deep learning algorithms, especially Convolutional Neural Networks, can be data hungry beasts.

And to make matters worse, manually annotating an image dataset can be a time consuming, tedious, and even expensive process.

So is there a way to leverage the power of Google Images to quickly gather training images and thereby cut down on the time it takes to build your dataset?

You bet there is.

In the remainder of today’s blog post I’ll be demonstrating how you can use Google Images to quickly (and easily) gather training data for your deep learning models.

Looking for the source code to this post?
Jump right to the downloads section.

Deep learning and Google Images for training data

Today’s blog post is part one of a three part series on a building a Not Santa app, inspired by the Not Hotdog app in HBO’s Silicon Valley (Season 4, Episode 4).

As a kid Christmas time was my favorite time of the year — and even as an adult I always find myself happier when December rolls around.

Looking back on my childhood, my dad always went out well of his way to ensure Christmas was a magical time.

Without him I don’t think this time of year would mean as much to me (and I certainly wouldn’t be the person I am today).

In order to keep the magic of ole’ Saint Nicholas alive, we’re going to spend the next three blog posts building our Not Santa detector using deep learning:

  • Part #1: Gather Santa Clause training data using Google Images (this post).
  • Part #2: Train our Not Santa detector using deep learning, Python, and Keras.
  • Part #3: Deploy our trained deep learning model to the Raspberry Pi.

Let’s go ahead and get started!

Using Google Images for training data and machine learning models

The method I’m about to share with you for gathering Google Images for deep learning is from a fellow deep learning practitioner and friend of mine, Michael Sollami.

He discussed the exact same technique I’m about to share with you in a blog post of his earlier this year.

I’m going to elaborate on these steps and provide further instructions on how you can use this technique to quickly gather training data for deep learning models using Google Images, JavaScript, and a bit of Python.

The first step in using Google Images to gather training data for our Convolutional Neural Network is to head to Google Images and enter a query.

In this case we’ll be using the query term “santa clause”:

Figure 1: The first step to downloading images from Google Image Search is to enter your query and let the pictures load in your browser. Santa Claus is visiting our computer screen!

As you can see from the example image above we have our search results.

The next step is to use a tiny bit of JavaScript to gather the image URLs (which we can then download using Python later in this tutorial).

Fire up the JavaScript console (I’ll assume you are using the Chrome web browser, but you can use Firefox as well) by clicking View => Developer => JavaScript Console :

Figure 2: Opening Google Chrome’s JavaScript Console from the menu bar prior to performing the hack.

From there, click the Console  tab:

Figure 3: We will enter JavaScript in the Google Chrome JavaScript Console which is displayed in this figure.

This will enable you to execute JavaScript in a REPL-like manner. The next step is to start scrolling!

Figure 4: Keep scrolling through the Google Image search results until the results are no longer relevant.

Keep scrolling until you have found all relevant images to your query. From there, we need to grab the URLs for each of these images. Switch back to the JavaScript console and then copy and paste this JavaScript snippet into the Console:

The snippet above pulls down the jQuery JavaScript library, a common package used for nearly every JavaScript application.

Now that jQuery is pulled down we can use a CSS selector to grab a list of URLs:

Note: Make sure you expand the code block above using the “<=>” button — this will ensure you copy and pate the entire JavaScript function call.

And then finally write the URLs to file (one per line):

After executing the above snippet you’ll have a file named urls.txt  in your default Downloads directory.

If you are having trouble following this guide, please see the video at the very top of this blog post where I provide step-by-step instructions.

Downloading Google Images using Python

Now that we have our urls.txt  file, we need to download each of the individual images.

Using Python and the requests library, this is quite easy.

If you don’t already have requests installed on your machine you’ll want to install it now (taking care to use the workon  command first if you are using Python virtual environments):

From there, open up a new file, name it download_images.py , and insert the following code:

Here we are just importing required packages. Notice requests  on Line 4 — this will be the package we use for downloading the image content.

Next, we’ll parse command line arguments and load our urls  from disk into memory:

Command line argument parsing is handled on Lines 9-14 — we only require two:

  • --urls : The path to the file containing image URLs generated by the Javascript trick above.
  • --output : The path to the output directory where we’ll store our images downloaded from Google Images.

From there, we load each URL from the file into a list on Line 18. We also initialize a counter, total , to count the files we’ve downloaded.

Next we’ll loop over the URLs and attempt to download each image:

Using requests , we just need to specify the url  and a timeout for the download. We attempt to download the image file into a variable, r , which holds the binary file (along with HTTP headers, etc.) in memory temporarily (Line 25).

Let’s go ahead and save the image to disk.

The first thing we’ll need is a valid path and filename. Lines 28 and 29 generate a path + filename, p , which will count up incrementally from 00000000.jpg .

We then create a file pointer, f , specifying our path,  p , and indicating that we want write mode in binary format ( "wb" ) on Line 30.

Subsequently, we write our files contents ( r.content ) and then close the file (Lines 31 and 32).

And finally, we update our total count of downloaded images.

If any errors are encountered along the way (and there will be some errors — you should expect them whenever trying to automatically download unconstrained images/pages on the web), the exception is handled and a message is printed to the terminal (Lines 39 and 40).

Now we’ll do a step that shouldn’t be left out!

We’ll loop through all files we’ve just downloaded and try to open them with OpenCV. If the file can’t be opened with OpenCV, we delete it and move on. This is covered in our last code block:

As we loop over each file, we’ll initialize a delete  flag to False (Line 45).

Then we’ll try  to load the image file on Line 49.

If the image  is loaded as None , or if there’s an exception, we’ll set delete = True  (Lines 53 and 54 and Lines 58-60).

Common reasons for an image being unable to load include an error during the download (such as a file not downloading completely), a corrupt image, or an image file format that OpenCV cannot read.

Lastly if the delete  flag was set, we call  os.remove  to delete the image on Lines 63-65.

That’s all there is to the Google Images downloader script — it’s pretty self-explanatory.

To download our example images, make sure you use the “Downloads” section of this blog post to download the script and example urls.txt  file.

From there, open up a terminal and execute the following command:

As you can see, example images from Google Images are being downloaded to my machine as training data.

The error you see in the output is normal — you should expect these. You should also expect some images to be corrupt and unable to open — these images get deleted from our dataset.

Pruning irrelevant images from our dataset

Of course, not every image we downloaded is relevant.

To resolve this, we need to do a bit of manual inspection.

My favorite way to do this is to use the default tools on my macOS machine. I can open up Finder and browse the images in the “Cover Flow” view:

Figure 5: The macOS “Cover Flow” view allows us to quickly check each downloaded image to make sure it’s Santa. We’ll want to be sure we’re training our deep learning detector (which we’ll cover next week) with valid Santa pictures.

I can then easily scroll through my downloaded images.

Images that are not relevant can easily moved to the Trash using  <cmd> + <delete>  — similar shortcuts exist on other operating systems as well. After pruning my downloaded images I have a total of 461 images as training to our Not Santa app.

In next week’s blog post I’ll demonstrate how we can use Python and Keras to train a Convolutional Neural Network to detect if Santa Clause is in an input image.

The complete Google Images + deep learning pipeline

I have put together a step-by-step video that demonstrates me performing the above steps to gather deep learning training data using Google Images.

Be sure to take a look!

Summary

In today’s blog post you learned how to:

  1. Use Google Images to search for example images.
  2. Grab the image URLs via a small amount of JavaScript.
  3. Download the images using Python and the requests library.

Using this method we downloaded ~550 images.

We then manually inspected the images and removed non-relevant ones, trimming the dataset down to ~460 images.

In next week’s blog post we’ll learn how to train a deep learning model that will be used in our Not Santa app.

To be notified when the next post in this series goes live, be sure to enter your email address in the form below!

Downloads:

If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 11-page Resource Guide on Computer Vision and Image Search Engines, including exclusive techniques that I don’t post on this blog! Sound good? If so, enter your email address and I’ll send you the code immediately!

, , , , ,

38 Responses to How to create a deep learning dataset using Google Images

  1. Anam December 4, 2017 at 12:08 pm #

    Thanks Adrian for sharing the awesome trick!

    • Adrian Rosebrock December 4, 2017 at 12:43 pm #

      It’s my pleasure to share, Anam! 🙂

  2. Pascal de Buren December 4, 2017 at 12:18 pm #

    Sweet post Adrian! In my projects I use bing’s image search API which is less of a hack (and also less creative ;)). However, it costs you a small amount of money and you need an Azure account.

    • Adrian Rosebrock December 4, 2017 at 12:44 pm #

      I’ve actually been playing around with Bing’s image search API. It’s really good and I hope to do a dedicated blog post on it soon.

  3. Pawel December 4, 2017 at 12:28 pm #

    Selenium is also good for tricks like that. I have found a very good google scrapper, modified it and now it’s power full!

    • Pawel December 4, 2017 at 12:32 pm #

      And one more thing. You don’t have to make ulr list files and also using “js” is not neded! Selenium can automatically find tags than urls on google image searcher and download big list of photos. It can also clisk “Show more photos” and scroll more to have MORE! And MORE in CNNs is always better :).

      • Adrian Rosebrock December 4, 2017 at 12:46 pm #

        Selenium is fantastic for stuff like this, I totally agree. Using the tags is a great way to expand the search as well. If other readers want to try this I would suggest that you manually look at the tags (to ensure the images are relevant) before doing this. Otherwise you’ll end up downloading non-relevant images. You can of course prune them out later but one of the goals here is to reduce the human intervention.

        • Pawel December 4, 2017 at 1:58 pm #

          I assume that human intervention is always “must to happen” after downloading images from google. The case here is to make as little and as fast as possible.

          I like to make it quick and automatic. And It’s very good to make it the way you once told me in our conversation. Use https://www.pyimagesearch.com/2017/09/11/object-detection-with-deep-learning-and-opencv/ for problem which is only classnames list :(. This script can return objects coordinates. Add little modification and you can crop your previously downloaded google images and store it.

          I checked “dogs” use case. I downloaded dogs per specious. Than localize them (on 98% of images dogs coordinates were found correctly), crop them and save. When human’s review is needed, just look at cropped thumbnails to be 100% sure that one dogs’ specious is good. And that is done. You reduced human intervention to minimum.

          The biggest drawback of this approach is that it is reduced only to classnames on caffemodel.

          Any way. Good job. I hope you don’t mind this little add on to your article.

          • Adrian Rosebrock December 4, 2017 at 6:03 pm #

            I’m always happy when readers contribute to the conversation 🙂

  4. Harald V. December 4, 2017 at 2:49 pm #

    Actually the Chrome Fatkun batch download image app is also great for downloading large numbers of images after a Google image search. First search in Google for your images and scroll down for as long as you need. You can click off the irrelevant images and preselect image size preferences (e.g. minimal size) and rename the images before download. Works generally rather well and fast.

    • Adrian Rosebrock December 4, 2017 at 6:01 pm #

      Very cool, thanks for sharing Harald!

  5. Romeo Disca December 4, 2017 at 3:58 pm #

    Essentially, this gathering technique is called ‘scrapeing’.

    As mentioned in a comment, Selenium can do it in a convenient manner. If you want to script from the command line with JavaScript, I can recommend you Nightwatch. Nightwatchjs.org

    • Adrian Rosebrock December 4, 2017 at 6:01 pm #

      Indeed, this is called scraping. I actually used Scrapy Python library, but it can be a real pain to automate the loading of all images (i.e., “scroll to see more”) via strict Python. As you noted, Selenium is great for those.

  6. Rob Jones December 4, 2017 at 5:26 pm #

    Thanks Adrian (and Michael) – I’ve wanted exactly this script for a while !

    One concern I had was about infringing copyright by using the images. Poking around, the consensus seems to be that downloading the images for the purpose of training a network is covered under ‘fair use’. The images cannot be reconstructed from the trained network. If I were to redistribute the images then it would be a different story.

    Now, if I were producing a commercial product then talking to an attorney would be prudent…

    • Adrian Rosebrock December 4, 2017 at 5:59 pm #

      Your understanding is correct. As long as you are not republishing the images and they cannot be reconstructed from the network (which is actually something that researchers are diving into now) then as it stands, you are okay. HOWEVER, when I say “you are okay” I’m saying as in the current interpretation of the law. I am not an attorney and you should seek proper legal counsel (I legally have to say that).

    • Matt December 4, 2017 at 7:29 pm #

      I know this may not be the most ethical thing to say, but I’ll just play devils advocate here. Let’s say the use of the images for training was not covered under ‘fair use’ and was prohibited.

      Would there be any way at all to prove that you used any particular image for training? What if any signature would be left by any individual image in the model? I could be wrong but I think that it would be next to impossible to prove.

      • Adrian Rosebrock December 5, 2017 at 7:28 am #

        It really depends on the machine learning model. With CNNs there is a concern that we can actually reconstruct training images from specific sets of nodes in the network. It’s an area of research and we’ll see if it yields anything, but yes it could be concern 5-10 years from now.

  7. Gary December 4, 2017 at 9:26 pm #

    I use Fatkun batch download as a Google Chrome extension and it works rather well. I think the only issue with both methods is removing irrelevant images. You should create a blog post on how to faster remove them.
    Sometimes one out of four images is irrelevant and it becomes quite laborious to manually delete them all.

  8. Harvey December 4, 2017 at 9:57 pm #

    I think this extension solves the problem easily !!

    https://chrome.google.com/webstore/detail/fatkun-batch-download-ima/nnjjahlikiabnchcpehcpkdeckfgnohf?hl=en

  9. Anthony The Koala December 4, 2017 at 11:54 pm #

    Dear Dr Adrian,
    Are there APIs which allow one to examine photos in: “Facebook”, “Twitter”, “SnapChat”, “ebay” and “amazon” as you demonstrated with Google?
    Thank you,
    Anthony, Sydney Australia

    • Adrian Rosebrock December 5, 2017 at 7:25 am #

      You would need to look at the APIs provided by the companies you listed. Facebook, Twitter, eBay, etc. have their own APIs.

  10. Sachin December 5, 2017 at 12:41 am #

    That’s a nifty trick Adrian. Thanks for sharing!! I will try this out in my next project.
    Btw I wonder if the precision and recall of the detector trained over this data be bounded by Google’s vision algorithms.

  11. Deekshith M R December 6, 2017 at 12:23 am #

    i’m getting like
    [INFO] error downloading {given path.jpg}…skipping

    • Adrian Rosebrock December 7, 2017 at 7:45 am #

      Are you getting that warning for every single image?

  12. Thimira Amaratunga December 6, 2017 at 10:39 am #

    Thanks Adrian!
    This trick would help me a lot in my current experiment in transfer learning.

    • Adrian Rosebrock December 7, 2017 at 7:45 am #

      It certainly would help with transfer learning, you’re absolutely correct. Best of luck with the project!

  13. Subash December 6, 2017 at 11:27 am #

    where is the imutils.py file?

    • Adrian Rosebrock December 7, 2017 at 7:45 am #

      You need to install it via pip:

      $ pip install imutils

      The above command will install the imutils package for you.

  14. Brian Rhyu December 7, 2017 at 9:16 am #

    Hello Adrian,

    Just wanted to suggest the icrawler Python library which has built-in classes to crawl Google, Bing, and Baidu images as well as aiding in creating custom crawlers.

    https://github.com/hellock/icrawler

    I’ve been using it lately to collect images for training.

    Thanks

    • Adrian Rosebrock December 8, 2017 at 4:47 pm #

      Thank you for the suggestion, Brian!

  15. Marc-Philippe Huget December 8, 2017 at 8:31 am #

    Hello Adrian,

    Thanks for this post that clarifies some of the things I was searching too. In your example, you look for Santa Claus which is a “whole” element.

    I am wondering if you have any experience or opinion on the ratio of pictures we should take where the concept is part of the picture or isolated on the picture. Let me try to explain: let us suppose you want to recognize the concept of dress, have you ever experienced some differences on the training whether the dress is on a uniform background or on a person? Should we take for instance 70% of pictures where the concept is alone and 30% of pictures in context? Thanks for your opinion

    Regards,
    mph

    • Adrian Rosebrock December 8, 2017 at 4:38 pm #

      I’m not fully sure I understand your question, but if I think you want to create an object detector that can detect a dress on a person along with a dress on a uniform background, such as in a product shot? I would suggest you gather as much training data as possible for each scenario.

      • Marc-Philippe HUGET December 9, 2017 at 6:38 am #

        That’s it. Said differently, I want to recognise the concept of dress. My opinion is if I only have dress on uniform background I could have trouble to recognise them when the dress is on a person. Isn’t it? I was wondering if there are so tips on deciding about the ratio between pictures for a concept on a uniform background and pictures for a concept in situation.

        Thanks fior your answer
        mph

        • Adrian Rosebrock December 9, 2017 at 7:24 am #

          Potentially yes, if you train your system on standard product-shot images your system may fail to fail to generalize to real-world images where there is a person walking on the street wearing a dress. You should include both sets of images in your training set. As for the aspect ratio, that really depends on what type of object detection framework you are using.

  16. Blue December 12, 2017 at 3:17 am #

    How do you get the not santa pics? What query do you use? Apologies if I’d missed something key in the post.

    • Adrian Rosebrock December 12, 2017 at 9:00 am #

      Please see the ‘Our “Santa” and “Not Santa” dataset’ section of the blog post. In particular this paragraph:

      “I then randomly sampled 461 images that do not contain Santa (Figure 1, right) from the UKBench dataset, a collection of ~10,000 images used for building and evaluating Content-based Image Retrieval (CBIR) systems (i.e., image search engines).”

  17. lm35 December 15, 2017 at 6:03 am #

    Hi Dr Adrian,

    When i run “python3.5 download_images.py –urls urls.txt –output images/santa”, I am getting error as “import cv2
    ImportError: No module named ‘cv2’ ”

    But if I go to “/usr/local/lib/python3.5/dist-packages”, I could see “cv2.so”. OpenCV version I intalled is 3.1.0.
    Can you please guide me.

    • Adrian Rosebrock December 15, 2017 at 8:17 am #

      I would suggest opening up a Python shell and typing “import cv2” to confirm that you have OpenCV properly installed. How did you install OpenCV on your system? Did you use one of the PyImageSearch tutorials?

Leave a Reply