Using Tesseract OCR with Python

In last week’s blog post we learned how to install the Tesseract binary for Optical Character Recognition (OCR).

We then applied the Tesseract program to test and evaluate the performance of the OCR engine on a very small set of example images.

As our results demonstrated, Tesseract works best when there is a (very) clean segmentation of the foreground text from the background. In practice, it can be extremely challenging to guarantee these types of segmentations. Hence, we tend to train domain-specific image classifiers and detectors.

Nevertheless, it’s important that we understand how to access Tesseract OCR via the Python programming language in the case that we need to apply OCR to our own projects (provided we can obtain the nice, clean segmentations required by Tesseract).

Example projects involving OCR may include building a mobile document scanner that you wish to extract textual information from or perhaps you’re running a service that scans paper medical records and you’re looking to put the information into a HIPA-Compliant database.

In the remainder of this blog post, we’ll learn how to install the Tesseract OCR + Python “bindings” followed by writing a simple Python script to call these bindings. By the end of the tutorial, you’ll be able to convert text in an image to a Python string data type.

To learn more about using Tesseract and Python together with OCR, just keep reading.

Looking for the source code to this post?
Jump right to the downloads section.

Using Tesseract OCR with Python

This blog post is divided into three parts.

First, we’ll learn how to install the pytesseract package so that we can access Tesseract via the Python programming language.

Next, we’ll develop a simple Python script to load an image, binarize it, and pass it through the Tesseract OCR system.

Finally, we’ll test our OCR pipeline on some example images and review the results.

To download the source code + example images to this blog post, be sure to use the “Downloads” section below.

Installing the Tesseract + Python “bindings”

Let’s begin by getting pytesseract  installed. To install pytesseract  we’ll take advantage of pip .

If you’re using a virtual environment (which I highly recommend so that you can separate different projects), use the workon  command followed by the appropriate virtual environment name. In this case, our virtualenv is named cv .

Next let’s install Pillow, a more Python-friendly port of PIL (a dependency) followed by pytesseract .

Note: pytesseract  does not provide true Python bindings. Rather, it simply provides an interface to the tesseract  binary. If you take a look at the project on GitHub you’ll see that the library is writing the image to a temporary file on disk followed by calling the tesseract  binary on the file and capturing the resulting output. This is definitely a bit hackish, but it gets the job done for us.

Let’s move forward by reviewing some code that segments the foreground text from the background and then makes use of our freshly installed pytesseract .

Applying OCR with Tesseract and Python

Let’s begin by creating a new file named  ocr.py :

Lines 2-6 handle our imports. The Image  class is required so that we can load our input image from disk in PIL format, a requirement when using pytesseract .

Our command line arguments are parsed on Lines 9-14. We have two command line arguments:

  • --image : The path to the image we’re sending through the OCR system.
  • --preprocess : The preprocessing method. This switch is optional and for this tutorial and can accept two values:  thresh  (threshold) or blur .

Next we’ll load the image, binarize it, and write it to disk.

First, we load --image  from disk into memory (Line 17) followed by converting it to grayscale (Line 18).

Next, depending on the pre-processing method specified by our command line argument, we will either threshold or blur the image. This is where you would want to add more advanced pre-processing methods (depending on your specific application of OCR) which are beyond the scope of this blog post.

The if  statement and body on Lines 22-24 perform a threshold in order to segment the foreground from the background. We do this using both  cv2.THRESH_BINARY  and cv2.THRESH_OTSU  flags. For details on Otsu’s method, see “Otsu’s Binarization” in the official OpenCV documentation.

We will see later in the results section that this thresholding method can be useful to read dark text that is overlaid upon gray shapes.

Alternatively, a blurring method may be applied. Lines 28-29 perform a median blur when the --preprocess  flag is set to blur . Applying a median blur can help reduce salt and pepper noise, again making it easier for Tesseract to correctly OCR the image.

After pre-processing the image, we use  os.getpid  to derive a temporary image filename based on the process ID of our Python script (Line 33).

The final step before using pytesseract for OCR is to write the pre-processed image, gray , to disk saving it with the filename  from above (Line 34).

We can finally apply OCR to our image using the Tesseract Python “bindings”:

Using pytesseract.image_to_string  on Line 38 we convert the contents of the image into our desired string, text . Notice that we passed a reference to the temporary image file residing on disk.

This is followed by some cleanup on Line 39 where we delete the temporary file.

Line 40 is where we print text to the terminal. In your own applications, you may wish to do some additional processing here such as spellchecking for OCR errors or Natural Language Processing rather than simply printing it to the console as we’ve done in this tutorial.

Finally, Lines 43 and 44 handle displaying the original image and pre-processed image on the screen in separate windows. The cv2.waitKey(0)  on Line 34 indicates that we should wait until a key on the keyboard is pressed before exiting the script.

Let’s see our handywork in action.

Tesseract OCR and Python results

Now that ocr.py  has been created, it’s time to apply Python + Tesseract to perform OCR on some example input images.

In this section we will try OCR’ing three sample images using the following process:

  • First, we will run each image through the Tesseract binary as-is.
  • Then we will run each image through ocr.py  (which performs pre-processing before sending through Tesseract).
  • Finally, we will compare the results of both of these methods and note any errors.

Our first example is a “noisy” image. This image contains our desired foreground black text on a background that is partly white and partly scattered with artificially generated circular blobs. The blobs act as “distractors” to our simple algorithm.

Figure 1: Our first example input for Optical Character Recognition using Python.

Using the Tesseract binary, as we learned last week, we can apply OCR to the raw, unprocessed image:

Tesseract performed well with no errors in this case.

Now let’s confirm that our newly made script, ocr.py , also works:

Figure 2: Applying image preprocessing for OCR with Python.

As you can see in this screenshot, the thresholded image is very clear and the background has been removed. Our script correctly prints the contents of the image to the console.

Next, let’s test Tesseract and our pre-processing script on an image with “salt and pepper” noise in the background:

Figure 3: An example input image containing noise. This image will “confuse” our OCR algorithm, leading to incorrect OCR results.

We can see the output of the tesseract  binary below:

Unfortunately, Tesseract did not successfully OCR the text in the image.

However, by using the blur  pre-processing method in ocr.py  we can obtain better results:

Figure 4: Applying image preprocessing with Python and OpenCV to improve OCR results.

Success! Our blur pre-processing step enabled Tesseract to correctly OCR and output our desired text.

Finally, let’s try another image, this one with more text:

Figure 5: Another example input to our Tesseract + Python OCR system.

The above image is a screenshot from the “Prerequisites” section of my book, Practical Python and OpenCV — let’s see how the Tesseract binary handles this image:

Followed by testing the image with ocr.py :

Figure 6: Applying Optical Character Recognition (OCR) using Python, OpenCV, and Tesseract.

Notice misspellings in both outputs including, but not limited to, “In”, “of”, “required”, “programming”, and “follow”.

The output for both of these do not match; however, interestingly the pre-processed version has only 8 word errors whereas the non-pre-processed image has 17 word errors (over twice as many errors). Our pre-processing helps even on a clean background!

Python + Tesseract did a reasonable job here, but once again we have demonstrated the limitations of the library as an off-the-shelf classifier.

We may obtain good or acceptable results with Tesseract for OCR, but the best accuracy will come from training custom character classifiers on specific sets of fonts that appear in actual real-world images.

Don’t let the results of Tesseract OCR discourage you — simply manage your expectations and be realistic on Tesseract’s performance. There is no such thing as a true “off-the-shelf” OCR system that will give you perfect results (there are bound to be some errors).

Note: If your text is rotated, you may wish to do additional pre-processing as is performed in this previous blog post on correcting text skew. Otherwise, if you’re interested in building a mobile document scanner, you now have a reasonably good OCR system to integrate into it.

Summary

In today’s blog post we learned how to apply the Tesseract OCR engine with the Python programming language. This enabled us to apply OCR algorithms from within our Python script.

The biggest downside is with the limitations of Tesseract itself. Tesseract works best when there are extremely clean segmentations of the foreground text from the background.

Furthermore these segmentations need to be as high resolution (DPI) as possible and the characters in the input image cannot appear “pixelated” after segmentation. If characters do appear pixelated then Tesseract will struggle to correctly recognize the text — we found this out even when applying images captured under ideal conditions (a PDF screenshot).

OCR, while no longer a new technology, is still an active area of research in the computer vision literature especially when applying OCR to real-world, unconstrained images. Deep learning and Convolutional Neural Networks (CNNs) are certainly enabling us to obtain higher accuracy, but we are still a long way from seeing “near perfect” OCR systems. Furthermore, as OCR has many applications across many domains, some of the best algorithms used for OCR are commercial and require licensing to be used in your own projects.

My primary suggestion to readers when applying OCR to their own projects is to first try Tesseract and if results are undesirable move on to the Google Vision API.

If neither Tesseract nor the Google Vision API obtain reasonable accuracy, you might want to reassess your dataset and decide if it’s worth it to train your own custom character classifier — this is especially true if your dataset is noisy and/or contains very specific fonts you wish to detect and recognize. Examples of specific fonts include the digits on a credit card, the account and routing numbers found at the bottom of checks, or stylized text used in graphic design.

I hope you are enjoying this series of blog posts on Optical Character Recognition (OCR) with Python and OpenCV!

To be notified when new blog posts are published here on PyImageSearch, be sure to enter your email address in the form below!

Downloads:

If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 11-page Resource Guide on Computer Vision and Image Search Engines, including exclusive techniques that I don’t post on this blog! Sound good? If so, enter your email address and I’ll send you the code immediately!

, , ,

50 Responses to Using Tesseract OCR with Python

  1. Balint July 10, 2017 at 11:38 am #

    Hi Adrian! This series is super useful! I’m wondering if there’s going to be a post about training an own custom character classifier.

    • Adrian Rosebrock July 11, 2017 at 6:30 am #

      Hi Balint — I actually demonstrate how to train a classifier to recognize handwritten digits inside Practical Python and OpenCV. A more thorough review (with source code) of general machine learning and object detection techniques is covered inside PyImageSearch Gurus.

  2. Todor Arnaudov July 10, 2017 at 9:08 pm #

    Hi, Adrian,

    I think some of the mistakes could be corrected with a bit of NLP post-processing, too.
    For example with NLTK: http://www.nltk.org/

    For a start, it would use dictionaries and a corpus of texts with computed n-grams of words and sequences of characters and part-of-speech tagging. The unlikely sequences would be spotted, similar ones with high frequency may be used for replacement or suggested for the suspicious segments.

    I’ll be more specific if/when I try to do it myself.

    • Adrian Rosebrock July 11, 2017 at 6:29 am #

      Absolutely. Anytime of natural language processing or domain specific regex can help improve the accuracy.

  3. cam July 10, 2017 at 10:18 pm #

    Please ignore my comment, I hadn’t installed the main package: brew install tesseract, but installed tesseract-py.

  4. Neeraj Bisht July 11, 2017 at 3:09 am #

    $ python ocr.py –image Downloads/10011050/1050.jpg

    gray = cv2.threshold(gray, 0, 255,
    ^
    IndentationError: expected an indented block

    Got this error while doing. Help

    • Adrian Rosebrock July 11, 2017 at 6:25 am #

      Make sure you use the “Downloads” section at the bottom of this page to download the source code and example images used in this post. During the copy and paste of the code you introduced an indentation error to the Python script, causing the error. Again, simply download the code using the “Downloads” section to use the code I have provided for you.

  5. Anthony The Koala July 11, 2017 at 4:30 am #

    Dear Dr Jason,
    Have there been any experiments by super-imposing different kinds of noise such as Gaussian, Poisson, the level of noise and the degree of noise reduction in order to determine the Tesseract package will respond to a particular noise family (Gaussian & Poisson) and the threshold of noise reduction for the Tesseract package to process images correctly?

    To put it another way:

    That is if the particular noise cannot be completely/significantly reduced can the Tesseract package successfully decode the text with say 99% accuracy?

    Also is there a particular noise distribution that the Tesseract OCR will successfully decode text 99%.

    Thank you,
    Anthony of Sydney NSW

    • Adrian Rosebrock July 11, 2017 at 6:24 am #

      Hi Anthony — it’s Adrian actually, not Jason 😉

      Regarding your questions, I think these are better suited for the Tesseract researchers and developers. I’m sure they have a bunch of benchmark tests they run (sort of like unit tests, only for the machine learning world). This is especially true with their new v4 release of Tesseract that will use LSTMs. I would suggest asking your specific question over at the Tesseract GitHub page as I do not know the answers to these questions off the top of my head.

  6. Dave July 11, 2017 at 11:30 am #

    which python version is this for?

    when i try to run it, it says: ImportError: No module named cv2

    • Adrian Rosebrock July 11, 2017 at 2:09 pm #

      Make sure you install Tesseract into the same environment that your OpenCV bindings are installed in. Did you use one of my tutorials when installing OpenCV? If so, don’t forget to use the workon cv command to access the cv virtual environment and then install Tesseract.

  7. Dani July 11, 2017 at 3:47 pm #

    Hi Adrian,

    How can I split a text from scanned document (binarized image) into lines in order to do OCR on each line?

    • Adrian Rosebrock July 12, 2017 at 2:46 pm #

      The Tesseract binary will automatically attempt to OCR each individual line for you. Is there a particular reason you want to go line-by-line?

      • Dani July 13, 2017 at 1:59 am #

        I’ve noticed that scanned document with different font sizes is a bit problematic (very poor OCR percentage), especially when the text is not accurately horizontal.
        I thought, doing OCR line by line will solve this.

        • Adrian Rosebrock July 14, 2017 at 7:30 am #

          I can see how this might be problematic. Instead of using Tesseract, perhaps try the Google Vision API and compare results.

  8. Andrew July 11, 2017 at 3:53 pm #

    Is there a reason you wrote the image to a temp file instead of using: pytesseract.image_to_string(Image.fromarray(gray)?

    • Adrian Rosebrock July 12, 2017 at 2:45 pm #

      The temporary file is in OpenCV format, so it’s written to disk first and then loaded via PIL/Pillow so it can be OCR’d by Tesseract. It’s a bit of a hack.

  9. Nitish Singh July 12, 2017 at 2:37 pm #

    How do I limit the pytesseract to alphanumeric or any other custom list?

    • Adrian Rosebrock July 12, 2017 at 2:38 pm #

      The easiest method is to consult the Tesseract FAQs. The page I linked to details how to return only digits, but you can modify it to return specific characters.

  10. Fernando July 13, 2017 at 5:38 pm #

    Hello there, ur code works fine in the sample test. But sometimes doesn’t print the result on different images.
    I already tried to change DPI image and resize then and couldn’t solve the problem.
    I uploaded 2 different images that i’m using for test and the first one has been identify and print correctly but the second one, doesn’t.

    http://imgur.com/a/OT3TX

    Any idea?

    • Adrian Rosebrock July 14, 2017 at 7:22 am #

      You need to localize the font first. Binarize (via thresholding) the image and extract the text regions. Then pass the regions through Tesseract. It’s likely that you are not applying enough pre-processing to your images. As I mentioned in the blog post, Tesseract works best when you can extract just the text regions and ignore the rest of the image.

  11. Ankur July 14, 2017 at 8:53 am #

    Hey !
    I was trying to implement your code but I am facing problem here:

    args = vars(ap.parse_args())

    While running this it gives me following error:

    pydevconsole.py: error: argument -i/–image is required

    I was thinking if you direct me in the right direction?

    • Adrian Rosebrock July 18, 2017 at 10:16 am #

      Hi Ankur — please read up on command line arguments and how they work before continuing.

    • Soham November 1, 2017 at 8:19 am #

      Hey Ankur, I am also getting the same error. How did you fix yours?
      Please help.

  12. Nipun July 27, 2017 at 7:05 am #

    Thanks Adrian for such a nice article. I am trying to achieve this on a video, which actually did work, but this slows down the whole process and I want to do this on live stream.

    Is there a way where we can avoid creating a temporary file and sending it to tesseract via Pillow ?

    Can we simply pass the matrix directly to the teserract, after doing some pre-processing on it ?

    • Adrian Rosebrock July 28, 2017 at 9:50 am #

      The Python + Tesseract “bindings” require that an intermediate file be written to disk. To speedup the process you could create a RAM filesystem, but as far as I know, you can’t pass the matrix directly into the Tesseract binary.

  13. Phill August 11, 2017 at 6:02 am #

    Using pytesseract might not be optimal due to disk I/O operations and subprocess calling of tesseract via os.syscall.
    There is another Python package that offers API acces to Tesseract.
    https://pypi.python.org/pypi/tesserocr
    Its docs are very well written.
    Simple OCR of image in numpy array might be done like:

    Profiled with profilehooks showed that 99% of time cost is due to api.GetUTF8Text() call.

    • Adrian Rosebrock August 14, 2017 at 1:19 pm #

      Thanks for sharing Phill!

    • Sébastien VINCENT August 18, 2017 at 3:28 am #

      In the same vein as tesserocr, there is PyOCR, an other Python package which offers access to a more complete API access to Tesseract. It can be found at https://github.com/openpaperwork/pyocr

      Example use:

      print txt

  14. Augustin August 12, 2017 at 7:48 am #

    Hi Adrian,

    Thanks for your articles, very useful!
    I was wondering, I’ve seen that the next Tesseract version is going to use LSTM as a classifier. But, do you know what is implemented in the current version?

    • Adrian Rosebrock August 14, 2017 at 1:14 pm #

      The current version of Tesseract does not use the LSTM classifier by default. You would need to download the new release manually.

  15. David August 16, 2017 at 9:04 am #

    Hello,
    Can you please help me and tell me where to find the different config options in
    pytesseract.image_to_string(image, lang=None, boxes=False, config=None)? I know we can set the page segmentation mode in c++, is it possible with pytesseract?

  16. Nasarudin August 21, 2017 at 11:48 pm #

    Hi Adrian, thank you for the article. It is great as always.
    I wonder if you already tried using OCR on a screenshot. I read somewhere that screenshot only has 72 dpi which is insufficient to OCR that needs bigger dpi(300 and above if I am not mistaken).

    My approach is to take a screenshot, process it by resizing/rescale up to 300%(already done), using the blur function to reduce noise(already done), and convert the image into black and white (have not try it yet)

    I would like to know your opinion on this. Maybe you have better solution. Thank you.

    • Adrian Rosebrock August 22, 2017 at 10:45 am #

      The larger the DPI, (normally) the better when it comes to OCR. As far as what DPI you are capturing a screenshot at, you would have to consult the documentation of your operating system/library used to take the screenshot.

  17. shruthi August 26, 2017 at 12:32 am #

    GOOD MORNING., SIR Is this tesseract can support any languages

    • Adrian Rosebrock August 27, 2017 at 10:36 am #

      Tesseract supports a number of language packs.

  18. Janderson September 3, 2017 at 9:17 am #

    Thanks for your article, very useful! But I have a question. Is it possible use your script to make OCR PDF files? The Tesseract official docs explains well in C++ but I didn’t find anything in pytesseract. Any idea?

  19. Dinesh Kumar September 18, 2017 at 4:54 am #

    Thanks for your detailed article, Adrian. I am new to OCR, Tesseract and all. This helped me a lot. Thanks, man.

    • Adrian Rosebrock September 18, 2017 at 2:00 pm #

      Awesome, I’m glad to hear it Dinesh 🙂

  20. James October 10, 2017 at 11:10 pm #

    Hi Adrian,

    I’m new to OCR, it was a great help. But I’m curious to put this in web app, can you give me guidelines…

    • Adrian Rosebrock October 13, 2017 at 9:00 am #

      I would suggest creating a REST-like API, as I do in this blog post.

  21. James October 10, 2017 at 11:11 pm #

    Thank you. Sir

  22. Ameer October 24, 2017 at 1:51 pm #

    Hi dear Adrian
    Could we use this with other languages? If yes, may you point out to the main ideas how this is possible?
    Thanks

    • Adrian Rosebrock October 24, 2017 at 2:26 pm #

      You can use Tesseract with C++ and C. See this link. You can also use the binary executable from any language where you can execute executables from within that language.

  23. Soham Khapre October 31, 2017 at 7:46 pm #

    Hi Adrian! Thank you for your code. I am new to Python and OCR so I don’t understand much about it. I gave the image’s path address in line 14 of the code but still I am getting an error saying – argument -i/-C:\Python27\Lib\site-packages\pytesseract is required. The above path is where my image is stored.
    Please help me at the earliest.

    • Adrian Rosebrock November 2, 2017 at 2:38 pm #

      You don’t actually need to modify Line 14. You need to supply it via command line argument. I would suggest reading up on command line arguments before continuing. I hope that helps!

    • Adrian Rosebrock November 2, 2017 at 2:39 pm #

      Another option would be to delete all code related to command line arguments and hard code your paths as separate variables.

  24. Lucian November 10, 2017 at 4:54 am #

    Hi

    Is there a way to see the word which is being processed with a bounding box around ?
    To make it simple, the goal I want to achieve is to create a bounding box and, given a word (by me), compare it to the one the OCR found. If the words are the same, delete/make blank the one with bounding box around (in the image).
    It’s possible ?

    • Adrian Rosebrock November 13, 2017 at 2:14 pm #

      Tesseract accepts an input image and displays the output text. Tesseract does not draw any bounding boxes. I would suggest localizing each text region using an algorithm similar to this one and from there processing each bounding box.

  25. Bill Runge November 19, 2017 at 8:24 am #

    Hi Adrian
    I have successfully installed and used tesseract-ocr and would like to experiment with the tessdata to see if I can improve the identification rate for a project that uses a relatively small number of dot matrix characters. In order to help with this I have been trying to install Qt Box Editor which appears to go well until the make command which fails because it appears that some of the tesseract and leptonica library files are not found. The install instructions suggest ensuring that the path to these is correct in the qtboxeditor.pro script, but I am not sure what the path is or where to insert it in the script. I would appreciate any insight you may have or a link to more detailed information. Thus far I have not found any useful help with this.
    Thank you for some great blogs

Leave a Reply