Document processing

Source: Wikipedia, the free encyclopedia.

Document processing is a field of research and a set of

archives
and historical documents.

Background

Document processing was initially as is still to some extent a kind of production line work dealing with the treatment of documents, such as letters and parcels, in an aim of sorting, extracting or massively extracting data. This work could be performed in-house or through business process outsourcing.[2][3] Document processing can indeed involve some kind of externalized manual labor, such as mechanical Turk.

As an example of manual document processing, as relatively recent as 2007,[4] document processing for "millions of visa and citizenship applications" was about use of "approximately 1,000 contract workers" working to "manage mail room and data entry."

While document processing involved data entry via keyboard well before use of a computer mouse or a computer scanner, a 1990 article in The New York Times regarding what it called the "paperless office" stated that "document processing begins with the scanner".[5] In this context, a former Xerox vice-president, Paul Strassman, expressed a critical opinion, saying that computers add rather than reduce the volume of paper in an office.[5] It was said that the engineering and maintenance documents for an airplane weigh "more than the airplane itself"[citation needed].

Automatic document processing

As the state of the art advanced, document processing transitioned to handling "document components ... as database entities."[6]

A technology called automatic document processing or sometimes intelligent document processing (ID) emerged as a specific form of

Intelligent Character Recognition (ICE) to extract data from several types documents.[7][8]

Applications

Automatic document processing applies to a whole range of documents, whether structured or not. For instance, in the world of business and finance, technologies may be used to process paper-based invoices, forms, purchase orders, contracts, and currency bills.[9] Financial institutions use intelligent document processing to process high volumes of forms such as regulatory forms or loan documents. ID uses AI to extract and classify data from documents, replacing manual data entry.[10]

In medicine, document processing methods have been developed to facilitate patient follow-up and streamline administrative procedures, in particular by digitizing medical or laboratory analysis reports. The goal is also to standardize medical databases.[11] Algorithms are also directly used to assist physicians in medical diagnosis, e.g. by analyzing magnetic resonance images,[12][13] or microscopic images.[14]

Document processing is also widely used in the humanities and digital humanities, in order to extract historical big data from archives or heritage collections. Specific approaches were developed for various sources, including textual documents, such as newspaper archives,[15] but also images,[16] or maps.[17][18]

Technologies

If, from the 1980s onward, traditional computer vision algorithms were widely used to solve document processing problems,[19][20] these have been gradually replaced by neural network technologies in the 2010s.[21] However, traditional computer vision technologies are still used, sometimes in conjunction with neural networks, in some sectors.

Many technologies support the development of document processing, in particular

semantic segmentation
algorithms.

These technologies often form the core of document processing. However, other algorithms may intervene before or after these processes. Indeed, document

image classification
technologies.

At the other end of the chain are various image completion, extrapolation or data cleanup algorithms. For textual documents, the interpretation can use natural language processing (NLP) technologies.

See also

References

  1. .
  2. .
  3. .
  4. ^ Julia Preston (December 2, 2007). "Immigration Contractor Trims Wages". The New York Times.
  5. ^ a b Lawrence M. Fisher (July 7, 1990). "Paper, Once Written Off, Keeps a Place in the Office". The New York Times.
  6. ^ Al Young; Dayle Woolstein; Jay Johnson (February 1996). "Unknown Title". Object Magazine. p. 51.
  7. ^ "Intelligent Document processing" (PDF). Department of Computer Science – University of Bari. 2005-04-07. Retrieved 2018-09-08.
  8. S2CID 17302169.{{cite book}}: CS1 maint: multiple names: authors list (link
    )
  9. ^ US active US7873576B2, John E. Jones; William J. Jones & Frank M. Csultis, "Financial document processing system", published 2011-01-18, issued 2011-01-18 
  10. ^ Bridgwater, Adrian. "Appian Adds Google Cloud Intelligence To Low-Code Automation Mix". Forbes. Retrieved 2021-04-21.
  11. . Retrieved 31 January 2021.
  12. . Retrieved 31 January 2021.
  13. .
  14. .
  15. ^ Ehrmann, Maud; Romanello, Matteo; Clematide, Simon; Ströbel, Phillip; Barman, Raphaël (2020). "Language Resources for Historical Newspapers: the Impresso Collection". Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France. pp. 958–968.
  16. ^ .
  17. ^ Ares Oliveira, Sofia; di Lenardo, Isabella; Tourenc, Bastien; Kaplan, Frédéric (11 July 2019). A deep learning approach to Cadastral Computing. Digital Humanities Conference. Utrecht, Netherlands.
  18. .
  19. . Retrieved 3 February 2021.
  20. ^ Tang, Yuan Y.; Lee, Seong-Whan; Suen, Ching Y. (1996). "Automatic document processing: a survey". Pattern Recognition. 29 (12): 1931–1952. . Retrieved 3 February 2021.
  21. .
  22. ^ "Revolutionary Scanning Technology for Art". Artmyn. Retrieved 3 February 2021.