Scraping images with Python and Scrapy

time_scrape_datasetSince this is a computer vision and OpenCV blog, you might be wondering: “Hey Adrian, why in the world are you talking about scraping images?”

Great question.

The reason is because image acquisition is one of the most under-talked about subjects in the computer vision field!

Think about it. Whether you’re leveraging machine learning to train an image classifier, building an image search engine to find relevant images in a collection of photos, or simply developing your own hobby computer vision application — it all starts with the images themselves. 

And where do these images come from?

Well, if you’re lucky, you might be utilizing an existing image dataset like CALTECH-256, ImageNet, or MNIST.

But in the cases where you can’t find a dataset that suits your needs (or when you want to create your own custom dataset), you might be left with the task of scraping and gathering your images. While scraping a website for images isn’t exactly a computer vision technique, it’s still a good skill to have in your tool belt.

In the remainder of this blog post, I’ll show you how to use the Scrapy framework and the Python programming language to scrape images from webpages.

Specifically, we’ll be scraping ALL Time.com magazine cover images. We’ll then use this dataset of magazine cover images in the next few blog posts as we apply a series of image analysis and computer vision algorithms to better explore and understand the dataset.

Looking for the source code to this post?
Jump right to the downloads section.

Installing Scrapy

I actually had a bit of a problem installing Scrapy on my OSX machine — no matter what I did, I simply could not get the dependencies installed properly (flashback to trying to install OpenCV for the first time as an undergrad in college).

After a few hours of tinkering around without success, I simply gave up and switched over to my Ubuntu system where I used Python 2.7. After that, installation was a breeze.

The first thing you’ll need to do is install a few dependencies to help Scrapy parse documents (again, keep in mind that I ran these commands on my Ubuntu system):

Note: This next step is optional, but I highly suggest you do it.

I then used virtualenv and virtualenvwrapper to create a Python virtual environment called scrapy  to keep my system site-packages  independent and sequestered from the new Python environment I was about to setup. Again, this is optional, but if you’re a virtualenv  user, there’s no harm in doing it:

In either case, now we need to install Scrapy along with Pillow, which is a requirement if you plan on scraping actual binary files (such as images):

Scrapy should take a few minutes to pull down its dependencies, compile, and and install.

You can test that Scrapy is installed correctly by opening up a shell (accessing the scrapy  virtual environment if necessary) and trying to import the scrapy  library:

If you get an import error (or any other error) it’s likely that Scrapy was not linked against a particular dependency correctly. Again, I’m no Scrapy expert so I would suggest consulting the docs or posting on the Scrapy community if you run into problems.

Creating the Scrapy project

If you’ve used the Django web framework before, then you should feel right at home with Scrapy — at least in terms of project structure and the Model-View-Template pattern; although, in this case it’s more of a Model-Spider pattern.

To create our Scrapy project, just execute the following command:

After running the command you’ll see a timecoverspider  in your current working directory. Changing into the timecoverspider directory, you’ll see the following Scrapy project structure:

In order to develop our Time magazine cover crawler, we’ll need to edit the following files two files: items.py  and settings.py . We’ll also need to create our customer spider, coverspider.py  inside the spiders  directory

Let’s start with the settings.py  file which only requires to quick updates. The first is to find the ITEMS_PIPELINE  tuple, uncomment it (if it’s commented out), and add in the following setting:

This setting will activate Scrapy’s default file scraping capability.

The second update can be appended to the bottom of the file. This value, FILES_STORE , is simply the path to the output directory where the download images will be stored:

Again, feel free to add this setting to the bottom of the settings.py  file — it doesn’t matter where in the file you put it.

Now we can move on to items.py , which allows us to define a data object model for the webpages our spider crawls:

The code here is pretty self-explanatory. On Line 2 we import our scrapy  package, followed by defining the MagazineCover  class on Line 4. This class encapsulates the data we’ll scrape from each of the time.com  magazine cover webpages. For each of these pages we’ll return a MagazineCover  object which includes:

  • title : The title of the current Time magazine issue. For example, this could be “Code Red”“The Diploma That Works”“The Infinity Machine”, etc.
  • pubDate : This field will store the date the issue was published on in year-month-day  format.
  • file_urls : The file_urls  field is a very important field that you must explicitly define to scrape binary files (whether it’s images, PDFs, mp3s), etc. from a website. You cannot name this variable differently and must be within your Item  sub-class.
  • files : Similarly, the files  field is required when scraping binary data. Do not name it anything different. For more information on the structure of the Item  sub-class intended to save binary data to disk, be sure to read this thread on Scrapy Google Groups.

Now that we have our settings updated and created our data model, we can move on to the hard part — actually implementing the spider to scrape Time for cover images. Create a new file in the spiders  directory, name it coverspider.py , and we’ll get to work:

Lines 2-4 handle importing our necessary packages. We’ll be sure to import our MagazineCover  data object, datetime  to parse dates from the Time.com website, followed by scrapy  to obtain access to our actual spidering and scraping tools.

From there, we can define the CoverSpider  class on Line 6, a sub-class of scrapy.Spider . This class needs to have two pre-defined values:

  • name : The name of our spider. The name  should be descriptive of what the spider does; however, don’t make it too long, since you’ll have to manually type it into your command line to trigger and execute it.
  • start_urls : This is a list of the seed URLs the spider will crawl first. The URL we have supplied here is the main page of the Time.com cover browser.

Every Scrapy spider is required to have (at a bare minimum) a parse  method that handles parsing the start_urls . This method can in turn yield other requests, triggering other pages to be crawled and spidered, but at the very least, we’ll need to define our parse  function:

One of the awesome aspects of Scrapy is the ability to traverse the Document Object Model (DOM) using simple CSS and XPath selectors. On Line 12 we traverse the DOM and grab the href  (i.e. URL) of the link that contains the text TIME U.S. . I have highlighted the “TIME U.S.” link in the screenshot below:

Figure 1: The first step in our scraper is to access the "TIME U.S." page.

Figure 1: The first step in our scraper is to access the “TIME U.S.” page.

I was able to obtain this CSS selector by using the Chrome browser, right clicking on the link element, selecting Inspect Element”, and using Chrome’s developer tools to traverse the DOM:

Figure 2: Utilizing Chrome's Developer tools to navigate the DOM.

Figure 2: Utilizing Chrome’s Developer tools to navigate the DOM.

Now that we have the URL of the link, we yield a Request  to that page (which is essentially telling Scrapy to “click that link”), indicating that the parse_page  method should be used to handle parsing it.

A screenshot of the TIME U.S.” page can be seen below:

Figure 3: On this page we need to extract all "Large Cover" links, followed by following the "Next" link in the pagination.

Figure 3: On this page we need to extract all “Large Cover” links, followed by following the “Next” link in the pagination.

We have two primary goals in parsing this page:

  • Goal #1: Grab the URLs of all links with the text “Large Cover” (highlighted in green in the figure above).
  • Goal #2: Once we have grabbed all the “Large Cover” links, we need to click the “Next” button (highlighted in purple), allowing us to follow the pagination and parse all issues of Time and grab their respective covers.

Below follows our implementation of the parse_page  method to accomplish exactly this:

We start off on Line 19 by looping over all link elements that contain the text Large Cover . For each of these links, we “click” it, and yield a request to that page using the parse_covers  method (which we’ll define in a few minutes).

Then, once we have generated requests to all cover pages, it’s safe to click the Next  button and use the same parse_page  method to extract data from the following page as well — this process is repeated until we have exhausted the pagination of magazine issues and there are no further pages in the pagination to process.

The last step is to extract the title , pubDate , and store the Time cover image itself. An example screenshot of a cover page from Time can be seen below:

Figure 4: On the actual cover page, we need to extract the issue title, publish date, and cover image URL.

Figure 4: On the actual cover page, we need to extract the issue title, publish date, and cover image URL.

Here I have highlighted the issue name in green, the publication date in red, and the cover image itself in purple. All that’s left to do is define the parse_covers  method to extract this data:

Just as the other parse methods, the parse_covers  method is also straightforward. Lines 30 and 31 extract the URL of the cover image.

Lines 33 grabs the title of the magazine issue, while Lines 35 and 36 extract the publication year and month.

However, the publication date could use a little formatting — let’s create a consisting formatting in year-month-day . While it’s not entirely obvious at this moment why this date formatting is useful, it will be very obvious in next week’s post when we actually perform a temporal image analysis on the magazine covers themselves.

Finally, Line 44 yields a MagazineCover  object including the title , pubDate , and imageURL  (which will be downloaded and stored on disk).

Running the spider

To run our Scrapy spider to scrape images, just execute the following command:

This will kick off the image scraping process, serializing each MagazineCover  item to an output file, output.json . The resulting scraped images will be stored in full , a sub-directory that Scrapy creates automatically in the output  directory that we specified via the FILES_STORE  option in settings.py  above.

Below follows a screenshot of the image scraping process running:

Figure 5: Kicking off our image scraper and letting it run.

Figure 5: Kicking off our image scraper and letting it run.

On my system, the entire scrape to grab all Time magazine covers using Python + Scrapy took a speedy 2m 23s not bad for nearly 4,000 images!

Our complete set of Time magazine covers

Now that our spider has finished scraping the Time magazine covers, let’s take a look at our output.json  file:

Figure 6: A screenshot of our output.json file.

Figure 6: A screenshot of our output.json file.

To inspect them individually, let’s fire up a Python shell and see what we’re working with:

As we can see, we have scraped a total of 3,969 images.

Each entry in the data  list is a dictionary, which essentially maps to our MagazineCover  data model:

We can easily grab the path to the Time cover image like this:

Inspecting the output/full  directory we can see we have our 3,969 images:

Figure 7: Our dataset of Time magazine cover images.

Figure 7: Our dataset of Time magazine cover images.

So now that we have all of these images, the big question is: “What are we going to do with them?!”

I’ll be answering that question over the next few blog posts. We’ll be spending some time analyzing this dataset using computer vision and image processing techniques, so don’t worry, your bandwidth wasn’t wasted for nothing! 

Note: If Scrapy is not working for you (or if you don’t want to bother setting it up), no worries — I have included the output.json  and raw, scraped .jpg  images in the source code download of the post found at the bottom of this page. You’ll still be able to follow along through the upcoming PyImageSearch posts without a problem.

Summary

In this blog post we learned how to use Python scrape all cover images of Time magazine. To accomplish this task, we utilized Scrapy, a fast and powerful web scraping framework. Overall, our entire spider file consisted of less than 44 lines of code which really demonstrates the power and abstraction behind the Scrapy libray.

So now that we have this dataset of Time magazine covers, what are we going to do with them?

Well, this is a computer vision blog after all — so next week we’ll start with a visual analytics project where we perform a temporal investigation of the cover images. This is a really cool project and I’m super excited about it!

Be sure to sign up for PyImageSearch newsletter using the form at the bottom of this post — you won’t want to miss the followup posts as we analyze the Time magazine cover dataset!

Downloads:

If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL! Sound good? If so, enter your email address and I’ll send you the code immediately!

, , ,

43 Responses to Scraping images with Python and Scrapy

  1. Guruprasad October 13, 2015 at 3:08 am #

    Compared to Scarpy, i felt the ‘Beautiful Soup’ library (along with Requests module) an easier tool for scarping images from websites.

    • Adrian Rosebrock October 13, 2015 at 7:05 am #

      For small projects, yes, you’re absolutely right.

      But Scrapy has a ton of extra features that you would have to manually implement when using BS4. For example, Scrapy handles multi-threading so you can have multiple requests being sent and processed at the same time. Scrapy handles all of the frustrating “connection timeouts” or when a page doesn’t load properly. Scrapy also handles serialization of the results out of the box.

      Overall, Scrapy can be overkill — or it can be just right for a large enough project.

  2. Neil Harding October 13, 2015 at 6:08 pm #

    Hi Adrian,

    Is this based on the article I mailed to you? Glad you thought it was interesting, love the site.

    • Adrian Rosebrock October 14, 2015 at 9:49 am #

      Yep! It was certainly based on the article you mailed in — thanks so much Neil! 😀

  3. PetarP October 15, 2015 at 4:10 pm #

    Hello Adrian,
    This is superb, enjoy building it whit you. I can now play whit SQLAlchemy and postgresql.
    Cheers!

  4. Shelly October 16, 2015 at 2:53 pm #

    Very nice! Just in time for me to scrape 425,000 pattern images for a image processing project 😉

    • Adrian Rosebrock October 17, 2015 at 6:44 am #

      Nice! 🙂

    • Daniel February 21, 2017 at 12:46 pm #

      Woah, any chance you’ll end up releasing this dataset?

  5. Harish Kamath October 16, 2015 at 8:50 pm #

    Great post Adrian! I managed to install Scrapy on OSX by following the instructions in https://github.com/scrapy/scrapy/issues/1126 (I had previously installed libxml so had that dependency satisifed)

    Apparently Apple has discontinued the use of openssl in favor of its own TLS and crypto libraries.

    Was able to create and run the Time Magazine cover scraper on my Mac.

    • Adrian Rosebrock October 17, 2015 at 6:44 am #

      Thanks for passing along the OSX instructions Harish, I’ll be sure to give that a try 🙂

  6. Paul October 19, 2015 at 1:50 am #

    Great post Adrian!

    Completely new to Scrapy and Django, so I am not sure how
    $ scrapy crawl pyimagesearch-cover-spider -o output.json
    is able to execute the web crawler.

    For those running this on windows, it turns out you need to run pip install pywin32 first. source:
    http://stackoverflow.com/questions/3580855/where-to-find-the-win32api-module-for-python

  7. Moazzam January 1, 2016 at 12:13 am #

    Loved the beauty of this article!!! Very to lucid, easy to understand and to the point.

    • Adrian Rosebrock January 1, 2016 at 7:24 am #

      Thanks Moazzam 😀

  8. Dmitry April 7, 2016 at 4:47 pm #

    Great article.
    Great website.
    Thank you.

    • Adrian Rosebrock April 8, 2016 at 12:54 pm #

      Thanks Dmitry! 🙂

  9. Huy Tu Duy August 16, 2016 at 9:49 pm #

    perfect tutorial! Adrian
    I want to ask how I can grab more than one image? In this tutorial you grab only one image
    Thank you!

    • Adrian Rosebrock August 17, 2016 at 12:02 pm #

      All you need is the yield keyword. You want to yield an object for each request or each image that is being saved. The specific logic and how the code needs to be updated is really dependent on your application.

  10. leeks August 19, 2016 at 9:52 am #

    Hi adrian, I’m getting an import error at:

    from timecoverspider.items import MagazineCover

    Is there a way to solve this? Would really love to try this out.

    • Adrian Rosebrock August 22, 2016 at 1:37 pm #

      Make sure you use the “Downloads” section of this blog post to download the source code to this post. From there, use the example command:

      $ scrapy crawl pyimagesearch-cover-spider -o output.json

      To execute the script.

  11. Jurgis November 20, 2016 at 1:13 pm #

    Hey Adrian, great post again, but I think that the url for the Time mag covers is broken. Just tried scraping it..

    • Adrian Rosebrock November 21, 2016 at 12:31 pm #

      That is certainly possible if the Time website changed their HTML structure.

  12. Cal November 26, 2016 at 4:29 pm #

    Also how did you get the output/full directory and also call up that snazzy viewer???

    • Adrian Rosebrock November 28, 2016 at 10:30 am #

      The output/full directory of images was obtained by running the scripts. I’m not sure what you mean by the “snazzy viewer”, but that is just a built-in file previewer in Mac OS.

    • San March 16, 2017 at 6:54 am #

      Same question

  13. Nico P April 20, 2017 at 11:45 am #

    Excelent Post Adrian! There´s a way to capture images with a resolution in particular? The idea is capture all ecommerce product pages that doesn´t contains photo. The problem is that this image (no photo) dont have any pattern in css or xpath to capture, only the resolution.

    It´s possible to crawl onlythese images with scrapy?

    Tanks in advance!

    • Adrian Rosebrock April 21, 2017 at 10:55 am #

      You would basically need to find all img tags in the HTML, loop over them, and check to see if the image meets your criteria.

      • Nico P April 21, 2017 at 6:13 pm #

        How I do that? Could you give me a little snippet to do it capture image resolution?

        Sorry, I’m new in this and I’m not an experienced developer

        Thanks so much

        • Adrian Rosebrock April 24, 2017 at 9:53 am #

          While I’m certainly to help point you in the right direction, I cannot write custom code snippets. If you’re new to Python or the world of computer vision and image processing, I would suggest you work through my book, Practical Python and OpenCV to help you learn the basics.

          • Nicolás Párraga April 28, 2017 at 9:39 am #

            Thanks Adrian, I made a custom copy of this code but the images files aren`t saving on folder taht I indicated on settings. Wich could be the problem?

            Thsnks

          • Adrian Rosebrock April 28, 2017 at 10:05 am #

            Hi Nicolás — without knowing the page structure of what you’re trying to scrape or more details on the project, it’s hard to say what the exact issue is. Also, please keep in mind that this is a computer vision blog so scraping tutorials such as these are often just supplementary to other, more advanced projects. If you are having problems with Scrapy, I would suggest posting on the Scrapy GitHub or asking a specific Scrapy-related question on StackOverflow. Sorry I couldn’t be of more help here!

  14. Nicolás Párraga April 30, 2017 at 3:52 pm #

    Excelent Post Adrian! Many Scrapy conventions changed since this post. Could yo do an update with last Scrapy updates (end Time site too).

    It`s is the most useful post I read!

    Thanks so much!

    • Adrian Rosebrock May 1, 2017 at 1:22 pm #

      I will certainly consider it; however, keeping this tutorial updated to reflect any changes Time makes to it’s DOM would be quite the undertaking. If others would want to help update the code I would consider it.

  15. C-dub August 2, 2017 at 3:41 am #

    Time has updated their DOM so the spider doesn’t work anymore. I spent about 2 hours trying to tweak it to work but I can’t so I have to bow-out from here.

    One link that works is http://content.time.com/time/coversearch/ but the XML elements are different so the code has to be re-written. Like I said, I spent about 2 hours unsuccessfully so I’m moving on to another tutorial. This one did work at one point, though.

    • Adrian Rosebrock August 4, 2017 at 7:05 am #

      I’m sorry to hear you spent the time trying to update the code, but with 200+ tutorials on the PyImageSearch blog I cannot keep every single one of them updated, especially those that require external resources that I do not control (such as the DOM of the website). In these cases I kindly ask readers to share their solutions in the comments form.

      • Lighting February 7, 2018 at 11:12 pm #

        View Large Cover

        for this

        for href in response.xpath(“//a[contains(., ‘View Large Cover’)]”):
        yield scrapy.Request(href.xpath(“@href”).extract_first(),
        self.parse_covers)

        is it right or wrong?

        I found the link which is working but I am confused how to write fetch from Xpath

  16. Mokhtar Ebrahim January 17, 2018 at 5:41 pm #

    Good post.
    Thanks.

  17. Peter July 27, 2018 at 2:08 am #

    Learning something. Recently I have used ScrapeStorm(www.scrapestorm.com) to scrape image to my pc, and have a good effect. I think it will help many people.

  18. Pierre-Yves Banaszak December 19, 2018 at 3:55 am #

    Hi Adrian,

    I love your work and your blog posts.

    Here is the frame by frame movie of National Geographic Covers (using python : Scrapy + Moviepy) :
    https://youtu.be/yWXW1wV2Fq0

    Amazing how the covers have evolved through the centuries.

    Keep up the good work !
    Thanks,

  19. Pierre-Yves Banaszak December 19, 2018 at 9:00 am #

    Sorry, new youtube URL : https://youtu.be/hMuDyJ6YAPY

    • Adrian Rosebrock December 19, 2018 at 1:46 pm #

      Great job, thanks for sharing!

Trackbacks/Pingbacks

  1. Analyzing 91 years of Time magazine covers for visual trends - PyImageSearch - October 19, 2015

    […] Today’s blog post will build on what we learned from last week: how to construct an image scraper using Python + Scrapy to scrape ~4,000 Time magazine cover images. […]

Leave a Reply

[email]
[email]