Nicomsoft OCR: Developer's Guide


Blk_SetWordRegEx


Syntax

C++:int Blk_SetWordRegEx(HBLK BlkObj, int LineIndex, int WordIndex, UNICODECHAR* RegEx, int Flags)
C#:int Blk_SetWordRegEx(int BlkObj, int LineIndex, int WordIndex, string RegEx, int Flags)
Visual Basic:Function Blk_SetWordRegEx(ByVal BlkObj As Integer, ByVal LineIndex As Integer, ByVal WordIndex As Integer, ByBal RegEx As String, ByVal Flags As Integer) As Integer
Java:int Blk_SetWordRegEx(HBLK BlkObj, int LineIndex, int WordIndex, String RegEx, int Flags)
Delphi:function Blk_SetWordRegEx(BlkObj:HBLK; LineIndex:integer; WordIndex:integer; RegEx:PWCHAR; Flags:integer):integer


Description

Sets a regular expression for the word or line, or for all words in the block. OCR will use that expression to validate the text and improve the recognition quality. The regular expression is used as a kind of dictionary to improve the accuracy of output text. It helps if the data format is known. For example, a table that contains amounts will be recognized better if you specify a regular expression that defines the format of amounts. The DEELX library is used, which supports a Perl-compatible regular expression syntax. You can find a detailed description of the syntax here: http://www.regexlab.com/en/deelx/syntax.htm
Also see the section "How to create regular expressions" below.


Parameters

BlkObj [IN] – the Block object. You can also specify the Image object if you want to work with the global list of text lines for the entire image.
LineIndex [IN] – the index of the text line. 0 – the first line. Specify -1 to apply the regular expression to all words in the block; you cannot use -1 if ImgObj is used instead of BlkObj.
WordIndex [IN] – the index of the word in the specified text line. 0 – the first word. Specify -1 to apply the regular expression to all words in the line.
RegEx [IN] – the regular expression to apply.
Flags [IN] – the operation flags; see the REGEX_XXXXX constants for possible values.


Return value

Zero if success, otherwise an error code.


Remarks

1. You can set a regular expression for a word (LineIndex >= 0, WordIndex >= 0), line (LineIndex >= 0, WordIndex = -1), row (LineIndex = -1, WordIndex >= 0), or for the entire block (LineIndex = -1, WordIndex = -1). OCR will use the regular expression (regex) for the word. If you have not specified any regex for the word, the regex for the line will be used; if you have not specified any regex for the line, the regex for the row will be used; if you have not specified any regex for the row, the regex for the block will be used. For example, if you have specified separate regular expressions for the entire block, for a line [#2], and for a word [line #2, row #3], then OCR will use the regular expression for the word and ignore the ones for the line and for the block when processing the word [#2, #3].
2. Setting a regular expression does not guarantee that the resulting text will comply with it. Regular expressions work as a kind of dictionary: If OCR cannot find a good text, it can select text that doesn’t match the regex.
3. Sometimes NSOCR cannot detect the letter case when it uses a regex to check the word, so it may check the text in the wrong case. Therefore, by default, the IGNORECASE mode is enabled; but since character sets are always case sensitive, better always specify both uppercase and lowercase letters in character sets.
4. You can also set a regular expression for the entire image, that is, for all words in all zones: Specify the IMG object as the "BlkObj" parameter, and use "-1" for both the "LineIndex" and "WordIndex" parameters. This global regular expression will not be cleared when a new image is loaded into the IMG object.


How to create regular expressions

The following table explains the most-used features of regular expressions:
Feature name Regex syntax Examples
Character from the group [ ] [sw]ay – allows "say" and "way". Use "\[" and "\]" to match "[" and "]".
Character not from the group [^] [^b-c]ay – allows "aay", "day", "eay", and so on (for example, even "4ay" is allowed), but forbids "bay" and "cay". Use "\^" to match "^" itself.
Uppercase Latin letter [A-Z] -
Lowercase Latin letter [a-z] -
Uppercase or lowercase Latin letter [A-Za-z] -
Any digit [0-9] -
Any character . .ay – allows "aay", "bay", "cay", "8ay", and so on. Use "\." to match "." itself.
OR | b(u|o)y – allows "buy" and "boy". Use "\|" to match "|" itself.
Grouping operation ( ) (10)|(11) – allows "10" and "11". Use "\(" and "\)" to match "(" and ")".
0 or more matches * 100* – allows the numbers 10, 100, 1000, and so on. 10*5 allows the numbers 15, 105, 1005, and so on. Use "\*" to match "*" itself.
0 or 1 matches ? 100? – allows the numbers 10 and 100. 10?5 allows the numbers 15 and 105. Use "\?" to match "?" itself.
1 or more matches + 100* – allows the numbers 100, 1000, 10000, and so on. 10*5 allows the numbers 105, 1005, and so on. Use "\+" to match "+" itself.
Reserved $ Use "\$" to match "$" itself.


The following table contains some examples:
Regular expressions Examples of allowed strings Explanations
((|0)[1-9])|([1|2][0-9])|(30)|(31) "1", "01", "10", "25", "31" Any day of the month.
((|0)[1-9])|(10)|(11)|(12) "1", "01", "10", "12" Any month number.
(19)|(20)[0-9][0-9] "1900", "1953", "2001", "2099" Any year between 1900 and 2099.
(19)|(20)[0-9][0-9] "1900", "1953", "2001", "2099" Any year between 1900 and 2099.
(|-)[0-9]+(\.|,)?([0-9][0-9])?\$? "5", "-55.47$", "348,00", "100.50" Any amount.


Example

See the sample code for the Blk_SetRect function.