Optical Character Recognition (OCR) – How it works
OCR is a complex technology that converts images containing text into formats with editable text. OCR allows you to process scanned books, screenshots, and photos with text, and get editable documents like TXT, DOC, or PDF files. This technology is widely used in many areas. The most advanced OCR systems can handle almost any types of images, even such complex ones as scanned magazine pages with images and columns, or photos from a mobile phone.
How do the modern OCR technologies work? The process of converting an image to an editable document is divided into several steps. Every step is a set of related algorithms that do a piece of the OCR job. The general steps in the OCR process are as follows:
- Loading an image as bitmap from a given source. The source can be a file or a pointer to a memory block. Moreover, a good OCR system must understand a lot of image formats: BMP, TIFF (both one-page and multi-page images), JPEG, PNG, and so on. It must also support PDF files, because many documents are stored as images in the PDF format, and the only way to extract text from such files is to perform OCR.
- Detecting the most important image features, such as resolution and inversion. Many OCR algorithms expect a predefined range of font sizes and foreground/background colors, so the image must be rescaled and inverted before processing if necessary.
- Image can be skewed, or it can have a lot of noise, so deskewing and denoising algorithms are applied to improve the image quality.
- Many OCR algorithms can handle bi-tonal images only, so color or grayscale images must be converted to bi-tonal. The process is called "binarization." This step is very important, because incorrect binarization will cause a lot of problems.
- Lines detection and removal. This step is required to improve page layout analysis, to achieve better recognition quality for underlined text, to detect tables, etc.
- Page layout analysis (also called "zoning"). The OCR system must detect the positions and types of all important areas in the image.
- Detection of text lines and words. Sometimes it is not an easy task because of different font sizes and small spaces between words.
- Combined-broken characters analysis. Oftentimes, some characters are broken into several parts, or some characters touch each other. It is necessary to detect such cases and find the correct position of every character.
- Recognition of characters. This is the main algorithm of OCR. An image of every character must be converted to the appropriate character code. Sometimes this algorithm produces several character codes for uncertain images. For example, recognition of the image of the "I" character can produce the codes for "I", "|", "1", "l"; the final character code will be selected later.
- Dictionary support. This step can improve the recognition quality. Some characters like "1" and "I", "C" and "G" can look very similar, and the dictionary can help to make the decision.
- Saving results to the selected output format, for example, searchable PDF, DOC, RTF, or TXT. It is important to save the original page layout: columns, fonts, colors, pictures, background, and so on.
It is not a complete list. A lot of other minor algorithms must be also implemented to achieve good recognition on various image types, but they are not principal in most cases and can vary in different OCR systems.
Every OCR step is very important. The whole OCR process will fail if any step cannot handle the given image correctly. Every algorithm must work correctly on the highest range of images, that is why there are only a few good universal OCR systems in the market. On the other hand, if some features of given images are known, the task becomes much easier. You can get better recognition quality if only one kind of images must be processed. To achieve the best results, if some features of the images are known, a good OCR system must be able to adjust the most important parameters of every algorithm. Sometimes that’s the only way to improve recognition quality. Unfortunately, even now there are no OCR systems that are as good as humans, and it looks like such systems will not be created in the near future.