User:AerodyneSr/sandbox

This is the user sandbox of AerodyneSr. A user sandbox is a subpage of the user's user page. It serves as a testing spot and page development space for the user and is not an encyclopedia article. Create or edit your own sandbox here.

Other sandboxes: Main sandbox | Template sandbox

Finished writing a draft article? Are you ready to request review of it by an experienced editor for possible inclusion in Wikipedia? Submit your draft for review!

Optical character recognition(optical character reader) (OCR) is the

text-to-speech, key data and text mining.OCR is a field of research in pattern recognition, artificial intelligence and computer vision

.

One of the primary goals of OCR is to covert text information contained in an image to their corresponding Unicode or ASCII representations. Two images shown below demonstrate the concept of OCR. In the first example, the result of an online OCR converter [1] is shown. In the second example, a virtual PNG format file containing the text "Origin" is used as an input to an OCR engine. The output of the OCR engine in this case is a string consisting of ASCII-encoded characters.

Example Output of an Online OCR Converter

Early versions needed to be trained with images of each character, and worked on one font at a time. Advanced systems capable of producing a high degree of recognition accuracy for most fonts are now common. Some systems are capable of reproducing formatted output that closely approximates the original page including images, columns, and other non-textual components.

History

Early optical character recognition may be traced to technologies involving telegraphy and creating reading devices for the blind.^[2] In 1914, Emanuel Goldberg developed a machine that read characters and converted them into standard telegraph code.^{[citation needed]} Concurrently, Edmund Fournier d'Albe developed the Optophone, a handheld scanner that when moved across a printed page, produced tones that corresponded to specific letters or characters.^[3]

In the late 1920s and into the 1930s Emanuel Goldberg developed what he called a "Statistical Machine" for searching microfilm archives using an optical code recognition system. In 1931 he was granted USA Patent number 1,838,389 for the invention. The patent was acquired by IBM.

Blind and visually impaired users

In 1974,

Scansoft, which merged with Nuance Communications.^{[citation needed]} The research group headed by A. G. Ramakrishnan at the Medical intelligence and language engineering lab, Indian Institute of Science, has developed PrintToBraille tool, an open source GUI frontend^[5]

that can be used by any OCR to convert scanned images of printed books to Braille books.

In the 2000s, OCR was made available online as a service (WebOCR), in a cloud computing environment, and in mobile applications like real-time translation of foreign-language signs on a smartphone.

Various commercial and open source OCR systems are available for most common writing systems, including Latin, Cyrillic, Arabic, Hebrew, Indic, Bengali (Bangla), Devanagari, Tamil, Chinese, Japanese, and Korean characters.

Applications

OCR engines have been developed into many kinds of object-oriented OCR applications, such as receipt OCR, invoice OCR, check OCR, legal billing document OCR.

They can be used for:

check
, passport, invoice, bank statement and receipt
Automatic number plate recognition
Automatic insurance documents key information extraction
Extracting business card information into a contact list
More quickly make textual versions of printed documents, e.g. book scanning for Project Gutenberg
Make electronic images of printed documents searchable, e.g. Google Books
Converting handwriting in real time to control a computer (pen computing)
Defeating CAPTCHA anti-bot systems, though these are specifically designed to prevent OCR^[6]^[7]^[8]
Assistive technology for blind and visually impaired users

Future applications of OCR

Although OCR-based applications have been on the market for a long time, the possibilities of future application of OCR are still enormous. Recently, researchers ^[9] have proposed a way to improve distance learning by using OCR to extract text shown on overhead or computer projections so that they can be transmitted along with the compressed video to allow for the display of more readable text on the client sides. Due to video compression, it is usually difficult to view content of projections within the video. The proposed method aims to solve the issue by recognizing and transmitting text shown on projections separately. The basic process of the proposed application is shown in the flowchart below.

In the first step, edge detection is performed on the projection of interests to get an outline of both text and graphics on the projection. The goal of edge detection is to facilitate the subsequent process of foreground-background segmentation in which each pixel will either be assigned an intensity of black or an intensity of white. Once the segmentation is completed, a modified version of OCR is then used to mark each block as either text or graphics. In the final stage, an off-the-shelf OCR program is used to recognize the character in each block that has been marked as text.

In the proposed application, when a web user clicks on a block that has been marked as text, he or she would see an expanded view containing clear and sharp text corresponding to the original texts shown on the projection. Since the text component would be encoded in either ASCII or Unicode independently from the video frame, they would not suffer from the loss of information due to video compression and therefore remain sharp and clear when displayed to the web user.

Types

Optical character recognition (OCR) – targets typewritten text, one glyph or character at a time.
Optical word recognition – targets typewritten text, one word at a time (for languages that use a space as a word divider). (Usually just called "OCR".)
printscript or cursive text one glyph or character at a time, usually involving machine learning
.

printscript or cursive
text, one word at a time. This is especially useful for languages where glyphs are not separated in cursive script.

OCR is generally an "offline" process, which analyzes a static document. Handwriting movement analysis can be used as input to handwriting recognition.^[10] Instead of merely using the shapes of glyphs and words, this technique is able to capture motions, such as the order in which segments are drawn, the direction, and the pattern of putting the pen down and lifting it. This additional information can make the end-to-end process more accurate. This technology is also known as "on-line character recognition", "dynamic character recognition", "real-time character recognition", and "intelligent character recognition".

Techniques

Pre-processing

OCR software often "pre-processes" images to improve the chances of successful recognition. Techniques include:^[11]

De-skew – If the document was not aligned properly when scanned, it may need to be tilted a few degrees clockwise or counterclockwise in order to make lines of text perfectly horizontal or vertical.
Despeckle – remove positive and negative spots, smoothing edges^[12]

Binarization – Convert an image from color or
greyscale to black-and-white (called a "binary image" because there are two colours). The task of binarization is performed as a simple way of separating the text (or any other desired image component) from the background.^[13] The task of binarization itself is necessary since most commercial recognition algorithms work only on binary images since it proves to be simpler to do so.^[14] In addition, the effectiveness of the binarization step influences to a significant extent the quality of the character recognition stage and the careful decisions are made in the choice of the binarization employed for a given input image type; since the quality of the binarization method employed to obtain the binary result depends on the type of the input image (scanned document, scene text image, historical degraded document etc.).^[15]^[16]

Line removal – Cleans up non-glyph boxes and lines

Layout analysis or "zoning" – Identifies columns, paragraphs, captions, etc. as distinct blocks. Especially important in multi-column layouts and tables
.

Line and word detection – Establishes baseline for word and character shapes, separates words if necessary.

Script recognition – In multilingual documents, the script may change at the level of the words and hence, identification of the script is necessary, before the right OCR can be invoked to handle the specific script.[17]
Character isolation or "segmentation" – For per-character OCR, multiple characters that are connected due to image artifacts must be separated; single characters that are broken into multiple pieces due to artifacts must be connected.
Normalize aspect ratio and scale^[18]

Segmentation of

proportional fonts, more sophisticated techniques are needed because whitespace between letters can sometimes be greater than that between words, and vertical lines can intersect more than one character.^[19]

Character recognition

There are two basic types of core OCR algorithm, which may produce a ranked list of candidate characters.[20]

Matrix matching involves comparing an image to a stored glyph on a pixel-by-pixel basis; it is also known as "pattern matching", "

image correlation".^[21]

This relies on the input glyph being correctly isolated from the rest of the image, and on the stored glyph being in a similar font and at the same scale. This technique works best with typewritten text and does not work well when new fonts are encountered. This is the technique the early physical photocell-based OCR implemented, rather directly.

Feature extraction decomposes glyphs into "features" like lines, closed loops, line direction, and line intersections. These are compared with an abstract vector-like representation of a character, which might reduce to one or more glyph prototypes. General techniques of

Nearest neighbour classifiers such as the k-nearest neighbors algorithm are used to compare image features with stored glyph features and choose the nearest match.^[22]

Software such as Cuneiform and Tesseract use a two-pass approach to character recognition. The second pass is known as "adaptive recognition" and uses the letter shapes recognized with high confidence on the first pass to recognize better the remaining letters on the second pass. This is advantageous for unusual fonts or low-quality scans where the font is distorted (e.g. blurred or faded).^[19]

Post-processing

OCR accuracy can be increased if the output is constrained by a lexicon – a list of words that are allowed to occur in a document.^[11] This might be, for example, all the words in the English language, or a more technical lexicon for a specific field. This technique can be problematic if the document contains words not in the lexicon, like proper nouns. Tesseract uses its dictionary to influence the character segmentation step, for improved accuracy.^[19]

The output stream may be a plain text stream or file of characters, but more sophisticated OCR systems can preserve the original layout of the page and produce, for example, an annotated PDF that includes both the original image of the page and a searchable textual representation.

"Near-neighbor analysis" can make use of co-occurrence frequencies to correct errors, by noting that certain words are often seen together.^[21] For example, "Washington, D.C." is generally far more common in English than "Washington DOC".

Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy.

Application-specific optimizations

In recent years,^[

automobile manufacturing

.

Font-independent OCR system

Traditional OCR engines tend to perform badly when the font of a probe document is not part of the training set. To address this issue, researchers ^[9] proposed a method that would learn the character models directly from the document it is trying to recognize. Unlike the majority of traditional OCR methods which rely on the appearance of characters to perform recognition, the font-independent OCR method relies on the repetitions of similar symbols, along with statics of a language to interpret a document. For example, to find the word 'a', the method firstly finds N number of words with the shortest length. The method then perform a match between each of the N selected words so as to calculate the frequency at which each word appears in the document. Then the symbol with a frequency close to 7% is considered as a character 'a' because according to the English language model letter 'a' accounts for about 7% of all letters.

Implementation

Implementation of OCR is easy thanks to existing open-source OCR engines. Among all open-source OCR engines, Tesseract is one of the most accurate OCR engines available. Below is a simple example usage of Tesseract.

#include <tesseract/baseapi.h> #include <leptonica/allheaders.h> int main() { char *outText; tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); // Initialize tesseract-ocr with English, without specifying tessdata path if (api->Init(NULL, "eng")) { fprintf(stderr, "Could not initialize tesseract.\n"); exit(1); } // Open input image with leptonica library Pix *image = pixRead("/usr/src/tesseract-3.02/phototest.tif"); api->SetImage(image); // Get OCR result outText = api->GetUTF8Text(); printf("OCR output:\n%s", outText); // Destroy used object and release memory api->End(); delete [] outText; pixDestroy(&image); return 0; }

^[23]
Source code and detailed APIs information of Tesseract can be found in Tesseract's repository on GitHub.
To facilitate the integration of text recognition and other functionalities provided in OpenCV, OpenCV 3.0 includes a module called text that is specifically intended to perform text detection and recognition. Within the text module, a class named cv::test::OCRTesseract provides an interface with the Tesseract APIs in C++. Tesseract-ocr needs to be correctly installed to have the code successfully compiled.
To create an instance of the OCRTesseract class, the following function needs to be called

static Ptr< OCRTesseract > create (const char *datapath=NULL, const char *language=NULL, const char *char_whitelist=NULL, int oem=3, int psmode=3)

To recognize text, the following function can be used

virtual void run (Mat &image, std::string &output_text, std::vector< Rect > *component_rects=NULL, std::vector< std::string > *component_texts=NULL, std::vector< float > *component_confidences=NULL, int component_level=0)

Workarounds

There are several techniques for solving the problem of character recognition by means other than improved OCR algorithms.

Forcing better input

Special fonts like
MICR
fonts, with precisely specified sizing, spacing, and distinctive character shapes, allow a higher accuracy rate during transcription. These were often used in early matrix-matching systems.
"Comb fields" are pre-printed boxes that encourage humans to write more legibly – one glyph per box.^[21] These are often printed in a "dropout color" which can be easily removed by the OCR system.^[21]
Palm OS used a special set of glyphs, known as "Graffiti" which are similar to printed English characters but simplified or modified for easier recognition on the platform's computationally limited hardware. Users would need to learn how to write these special glyphs.
Zone-based OCR restricts the image to a specific part of a document. This is often referred to as "Template OCR".

Crowdsourcing

Crowdsourcing humans to perform the character recognition can quickly process images like computer-driven OCR, but with higher accuracy for recognizing images than is obtained with computers. Practical systems include the Amazon Mechanical Turk and reCAPTCHA.

Accuracy

This user page needs to be updated. Please help update this user page to reflect recent events or newly available information. (March 2013)

Commissioned by the
U.S. Department of Energy (DOE), the Information Science Research Institute (ISRI) had the mission to foster the improvement of automated technologies for understanding machine printed documents, and it conducted the most authoritative of the Annual Test of OCR Accuracy from 1992 to 1996.^[24]

Recognition of Latin-script, typewritten text is still not 100% accurate even where clear imaging is available. One study based on recognition of 19th- and early 20th-century newspaper pages concluded that character-by-character OCR accuracy for commercial OCR software varied from 81% to 99%;^[25] total accuracy can be achieved by human review or Data Dictionary Authentication. Other areas—including recognition of hand printing, cursive handwriting, and printed text in other scripts (especially those East Asian language characters which have many strokes for a single character)—are still the subject of active research. The MNIST database is commonly used for testing systems' ability to recognize handwritten digits.
Accuracy rates can be measured in several ways, and how they are measured can greatly affect the reported accuracy rate. For example, if word context (basically a lexicon of words) is not used to correct software finding non-existent words, a character error rate of 1% (99% accuracy) may result in an error rate of 5% (95% accuracy) or worse if the measurement is based on whether each whole word was recognized with no incorrect letters.^[26]
Web based OCR systems for recognizing hand-printed text on the fly have become well known as commercial products in recent years^[when?] (see Tablet PC history). Accuracy rates of 80% to 90% on neat, clean hand-printed characters can be achieved by pen computing software, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications.^{[citation needed]}
Recognition of cursive text is an active area of research, with recognition rates even lower than that of hand-printed text. Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information. For example, recognizing entire words from a dictionary is easier than trying to parse individual characters from script. Reading the Amount line of a cheque (which is always a written-out number) is an example where using a smaller dictionary can increase recognition rates greatly. The shapes of individual cursive characters themselves simply do not contain enough information to accurately (greater than 98%) recognise all handwritten cursive script.^{[citation needed]}

Unicode

Main article: Optical Character Recognition (Unicode block)

Characters to support OCR were added to the Unicode Standard in June 1993, with the release of version 1.1.
Some of these characters are mapped from fonts specific to
MICR or OCR-A
.

Optical Character Recognition^[1]^[2]
Official Unicode Consortium code chart (PDF)

0 1 2 3 4 5 6 7 8 9 A B C D E F

U+244x ⑀ ⑁ ⑂ ⑃ ⑄ ⑅ ⑆ ⑇ ⑈ ⑉ ⑊

U+245x

Notes
1.^ As of Unicode version 15.1

2.^ Grey areas indicate non-assigned code points

See also

AI effect

Applications of artificial intelligence

Computational linguistics

Digital library

Digital mailroom

Digital pen

Institutional repository

Live Ink Character Recognition Solution

Music OCR

Optical mark recognition

Outline of artificial intelligence

Sketch recognition

Speech recognition

Vectorization (image tracing)

Voice recording

List of emerging technologies

References

^ https://www.newocr.com/accessdate=2015-12-02. {{cite web}}: Missing or empty |title= (help)

^
ISBN 9780943072012
.

doi:10.1098/rspa.1914.0061
.

^ "The History of OCR". Data Processing Magazine. 12: 46. 1970.

^ PrintToBraille Tool. "ocr-gui-frontend". MILE Lab, Dept of EE, IISc. Archived from the original on December 25, 2014. Retrieved 7 December 2014.

^ "How To Crack Captchas". andrewt.net. 2006-06-28. Retrieved 2013-06-16.

^ "Breaking a Visual CAPTCHA". Cs.sfu.ca. 2002-12-10. Retrieved 2013-06-16.

^ John Resig (2009-01-23). "John Resig – OCR and Neural Nets in JavaScript". Ejohn.org. Retrieved 2013-06-16.

^ ^a ^b Wallick, Lobo, Shah. "A Computer Vision Framework for Analyzing Projections from Video of Lectures" (PDF). Retrieved 2015-12-02.{{cite web}}: CS1 maint: multiple names: authors list (link) Cite error: The named reference "Font-free OCR" was defined multiple times with different content (see the help page).

doi:10.1109/34.57669
.

^ ^a ^b "Optical Character Recognition (OCR) – How it works". Nicomsoft.com. Retrieved 2013-06-16.

^ ^a ^b "How OCR Software Works". OCRWizard. Retrieved 2013-06-16.

doi:10.1117/1.1631315
. Retrieved 2 May 2015.

doi:10.1016/j.patcog.2006.04.043
. Retrieved 2 May 2015.

doi:10.1109/34.476511
. Retrieved 2 May 2015.

S2CID 8947361
. Retrieved 2 May 2015.

doi:10.1016/j.patrec.2008.01.027
.

^ "Basic OCR in OpenCV | Damiles". Blog.damiles.com. Retrieved 2013-06-16.

^ ^a ^b ^c Ray Smith (2007). "An Overview of the Tesseract OCR Engine" (PDF). Retrieved 2013-05-23.

^ "OCR Introduction". Dataid.com. Retrieved 2013-06-16.

^ ^a ^b ^c ^d "How does OCR document scanning work?". Explain that Stuff. 2012-01-30. Retrieved 2013-06-16.

^ "The basic patter recognition and classification with openCV | Damiles". Blog.damiles.com. Retrieved 2013-06-16.

^ "TesseractOCR API Examples;". Retrieved 2015-12-02.

^ Code and Data to evaluate OCR accuracy, originally from UNLV/ISRI

^ Holley, Rose (April 2009). "How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs". D-Lib Magazine. Retrieved 5 January 2011.

^ Suen, C.Y.; Plamondon, R.; Tappert, A.; Thomassen, A.; Ward, J.R.; Yamamoto, K. (1987-05-29). Future Challenges in Handwriting and Computer Applications. 3rd International Symposium on Handwriting and Computer Applications, Montreal, May 29, 1987. Retrieved 2008-10-03.

External links

Wikimedia Commons has media related to Optical character recognition.

Unicode OCR – Hex Range: 2440-245F Optical Character Recognition in Unicode

Annotated bibliography of references to handwriting character recognition and pen computing

Notes on the History of Pen-based Computing (YouTube)

v
t
e
Optical character recognition software
Free software

CuneiForm

GOCR

Ocrad

OCRFeeder

OCRopus

Tesseract

Proprietary software

ABBYY FineReader

Adobe Acrobat Pro

Asprise OCR

Microsoft Office Document Imaging

OmniPage

ReadSoft

SmartScore

TeleForm

VueScan

See also

Comparison of optical character recognition software

v
t
e
Natural language processing
General terms

AI-complete

Bag-of-words

n-gram
Bigram

Trigram

Computational linguistics

Natural language understanding

Stop words

Text processing

Text analysis

Argument mining

Collocation extraction

Concept mining

Coreference resolution

Deep linguistic processing

Distant reading

Information extraction

Named-entity recognition

Ontology learning

Parsing
Semantic parsing

Syntactic parsing

Part-of-speech tagging

Semantic analysis

Semantic role labeling

Semantic decomposition

Semantic similarity

Sentiment analysis

Terminology extraction

Text mining

Textual entailment

Truecasing

Word-sense disambiguation

Word-sense induction

Text segmentation

Compound-term processing

Lemmatisation

Lexical analysis

Text chunking

Stemming

Sentence segmentation

Word segmentation

Automatic summarization

Multi-document summarization

Sentence extraction

Text simplification

Machine translation

Computer-assisted

Example-based

Rule-based

Statistical

Transfer-based

Neural

Distributional semantics models

BERT

Document-term matrix

Explicit semantic analysis

fastText

GloVe

Language model (large)

Latent semantic analysis

Seq2seq

Word embedding

Word2vec

Language resources,
datasets and corpora
Types and
standards

Corpus linguistics

Lexical resource

Linguistic Linked Open Data

Machine-readable dictionary

Parallel text

PropBank

Semantic network

Simple Knowledge Organization System

Speech corpus

Text corpus

Thesaurus (information retrieval)

Treebank

Universal Dependencies

Data

BabelNet

Bank of English

DBpedia

FrameNet

Google Ngram Viewer

UBY

WordNet

Wikidata

Automatic identification
and data capture

Speech recognition

Speech segmentation

Speech synthesis

Natural language generation

Optical character recognition

Topic model

Document classification

Latent Dirichlet allocation

Pachinko allocation

Computer-assisted
reviewing

Automated essay scoring

Concordancer

Grammar checker

Predictive text

Pronunciation assessment

Spell checker

Natural language
user interface

Chatbot

Syntax guessing
)

Question answering

Virtual assistant

Voice user interface

Related

Formal semantics

Hallucination

Natural Language Toolkit

spaCy

Category:Artificial intelligence applications Category:Applications of computer vision Category:Automatic identification and data capture Category:Computational linguistics Category:Unicode Category:Symbols Category:Machine learning task

Retrieved from "https://en.wikipedia.org/w/index.php?title=User:AerodyneSr/sandbox&oldid=1092624446"

[Free_Online_OCR-1] ttps://www.newocr.com/accessdate=2015-12-02. {{cite web}}: Missing or empty |title= (help)

[Scantz82-2] 
ISBN 9780943072012
.

[3] :10.1098/rspa.1914.0061
.

[4] "The History of OCR". Data Processing Magazine. 12: 46. 1970.

[5] PrintToBraille Tool. "ocr-gui-frontend". MILE Lab, Dept of EE, IISc. Archived from the original on December 25, 2014. Retrieved 7 December 2014.

[6] "How To Crack Captchas". andrewt.net. 2006-06-28. Retrieved 2013-06-16.

[7] "Breaking a Visual CAPTCHA". Cs.sfu.ca. 2002-12-10. Retrieved 2013-06-16.

[8] John Resig (2009-01-23). "John Resig – OCR and Neural Nets in JavaScript". Ejohn.org. Retrieved 2013-06-16.

[Font-free_OCR-9] Wallick, Lobo, Shah. "A Computer Vision Framework for Analyzing Projections from Video of Lectures" (PDF). Retrieved 2015-12-02.{{cite web}}: CS1 maint: multiple names: authors list (link) Cite error: The named reference "Font-free OCR" was defined multiple times with different content (see the help page).

[10] :10.1109/34.57669
.

[nicomsoft-11] "Optical Character Recognition (OCR) – How it works". Nicomsoft.com. Retrieved 2013-06-16.

[ocrwizard-12] "How OCR Software Works". OCRWizard. Retrieved 2013-06-16.

[Sezgin2004-13] :10.1117/1.1631315
. Retrieved 2 May 2015.

[Gupta2007-14] :10.1016/j.patcog.2006.04.043
. Retrieved 2 May 2015.

[Trier1995-15] :10.1109/34.476511
. Retrieved 2 May 2015.

[Milyaev2013-16] S2CID 8947361
. Retrieved 2 May 2015.

[17] :10.1016/j.patrec.2008.01.027
.

[18] "Basic OCR in OpenCV | Damiles". Blog.damiles.com. Retrieved 2013-06-16.

[Tesseract_overview-19] Ray Smith (2007). "An Overview of the Tesseract OCR Engine" (PDF). Retrieved 2013-05-23.

[20] "OCR Introduction". Dataid.com. Retrieved 2013-06-16.

[explain-21] "How does OCR document scanning work?". Explain that Stuff. 2012-01-30. Retrieved 2013-06-16.

[22] "The basic patter recognition and classification with openCV | Damiles". Blog.damiles.com. Retrieved 2013-06-16.

[23] "TesseractOCR API Examples;". Retrieved 2015-12-02.

[24] Code and Data to evaluate OCR accuracy, originally from UNLV/ISRI

[25] Holley, Rose (April 2009). "How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs". D-Lib Magazine. Retrieved 5 January 2011.

[26] Suen, C.Y.; Plamondon, R.; Tappert, A.; Thomassen, A.; Ward, J.R.; Yamamoto, K. (1987-05-29). Future Challenges in Handwriting and Computer Applications. 3rd International Symposium on Handwriting and Computer Applications, Montreal, May 29, 1987. Retrieved 2008-10-03.

[2]

[3]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[18]

[19]

[21]

[22]

[23]

[24]

[25]

[26]