Table extraction
Table extraction is the process of recognizing and separating a table from a large document, possibly also recognizing individual rows, columns or elements. It may be regarded as a special form of information extraction.
Table extractions from
More challenging is table extraction from PDFs or scanned images, where there usually is no table-specific machine readable markup.[1] Systems that extract data from tables in scientific PDFs have been described.[2][3]
Wikipedia presents some of its information in tables, and, e.g., 3.5 million tables can be extracted from the English Wikipedia.[4] Some of the tables have a specific format, e.g., the so-called infoboxes. Large-scale table extraction of Wikipedia infoboxes forms one of the sources for DBpedia.[5]
Commercial web services for table extraction exist, e.g., Amazon Textract, Google's Document AI, IBM Watson Discovery, and Microsoft Form Recognizer.[1] Open source tools also exist, e.g., PDFFigures 2.0 that has been used in Semantic Scholar.[6] In a comparison published in 2017, the researchers found the proprietary program ABBYY FineReader to yield the best PDF table extraction performance among six different tools evaluated.[7] In a 2023 benchmark evaluation,[8] Adobe Extract,[9] a cloud-based API that employs Adobe’s Sensei AI-platform,[10] performed best among five tools evaluated for table extraction.
References
- ^ Wikidata Q108170445.
- )
- )
- Wikidata Q108215401.
- )
- Wikidata Q108172042
- Wikidata Q108173686
- ISBN 978-3-031-28031-3
- ^ "Adobe PDF Extract API". Adobe. Retrieved 2024-03-15.
{{cite web}}
: CS1 maint: url-status (link) - ^ "Experience Cloud AI Services with Adobe Sensei". Adobe. Retrieved 2024-03-15.
{{cite web}}
: CS1 maint: url-status (link)