Recently while working on a problem for reading some text from PDF Files, we were faced with the challenge for selecting and using OCR tool from within C# Programming Language & Create an API wrapper which will accept the location of a PDF file on server and return the Text matching specific patterns for each Page.

We decided to Use Google Tesseract 3.04 for this requirement. But Google Tesseract is a C/C++ Library.
On searching we found out following project which is providing a C# wrapper around Tesseract.

Git Hub:
https://github.com/charlesw/tesseract

Nuget:
https://www.nuget.org/packages/Tesseract/3.0.2

Using this library was straight forward. We created a ASP.NET Web API Project in Visual Studio 2017. Added the Tesseract NuGet Package by running Install-Package Tesseract from the Package Manager Console.

Since binaries were compiled with Visual Studio 2015 we installed Visual Studio 2015 Runtime

Tesseract requires Language and testdata/support data for the language you want to do ocr for. Therefore we downloaded the language data files for tesseract 3.04 from

https://github.com/tesseract-ocr/tessdata/releases/tag/3.04.00

Tesseract requires Language and testdata/support data for the language you want to do ocr for. Therefore we downloaded the language data files for tesseract 3.04 from https://github.com/tesseract-ocr/tessdata/releases/tag/3.04.00

Next we created a Folder called tessdata in our project and copy pasted the language files downloaded in previous step. Since the tessdata is required to initialize tesseract engine, we change the Build Action to None and Copy Always to output directory as true, ensuring that tessdata is available with deployment.

 Added following lines of codes to Initialize the tesseract engine in a class.

private static TesseractEngine _engine;

private static TesseractEngine Engine
{
           get
           {
               if (_engine == null || _engine.IsDisposed)
               {
                   _engine = new TesseractEngine($@"C:\Users\iMentor\Source\Repos\Tesseract\Tesseract.API\bin\tessdata", "eng", EngineMode.TesseractAndCube);

               }
               return _engine;
           }
}

The above code, creates a private field and a public property called engine implementing the singleton pattern to avoid re-initilizing engine.

The constructor to TesseractEngine acceps the location of tessdata folder, language and EngineMode. EngineMode.TesseractAndCube provides best accuracy but is a little slow on performance.

One tesseract engine is initialized, we can read a Tiff from a Byte Array or an Image into a Pix object and ask tesseract engine to process the pix. This returns a Page object which can then be used to read & retrieve the identified text from image as below

var pix = Pix.LoadFromFile(filenameofimage);
var page = Engine.Process(pix);

You can use page object to specify a region on interest to restrict the OCR operation to a specifc Rectangle by passing a Rect object.

page.RegionOfInterest = new Rect(100,100,500,500);

You can use the page.GetThresholdedImage() to get an Image object created by tesseract by using different image re-construction techniques . The image returned is the image which is then used for Actual OCR Operation.

For debugging purposes you can save the image to a file and then investigate about any OCR issues.

page.GetThresholdedImage().Save(page.ImageName + "imageToBeOcred.png");

You can read All the text by calling page.GetText() which will return a string of all text which tesseract could read from the image.

Console.WriteLine(page.GetText());

You read different Blocks, Paragraphs, TextLines and Words recognized by Tesseract using page.GetIterator() which will give you an iterator to use for reading different parts.

var iter = page.GetIterator()

using iterator object you can move the iterator to a specifc position by using iter.Next by passing where you want to go next

iter.Next(PageIteratorLevel.Block)

Next moves to start of the next element at the provided level. PageIteratorLevel has following options

Block, Para, Symbol, TextLine, Word

Once the iterator is placed at a position you can get the Text by using GetText method of iterator providing what you want to read from that position by providing PageIteratorLevel.

iter.GetText(PageIteratorLevel.Block)

This makes it easy to read a document / image by quicly going to a specific element type and then reading text of elements.

tesseract is a powerfull and quite accurate engine and can be customized to suit your requirements. In this blog post I have shared basic steps to get you started with using tesseract. For more information please follow along the below links:

Tesseract OCT on Git: https://github.com/tesseract-ocr

C# Wrapper for Tesseract: https://github.com/charlesw/tesseract