Data scraping

Data scraping is a technique where a

human-readable

output coming from another program.

Description

Normally,

protocols are typically rigidly structured, well-documented, easily parsed

, and minimize ambiguity. Very often, these transmissions are not human-readable at all.

Thus, the key element that distinguishes data scraping from regular

end-user, rather than as an input to another program. It is therefore usually neither documented nor structured for convenient parsing. Data scraping often involves ignoring binary data (usually images or multimedia data), display

formatting, redundant labels, superfluous commentary, and other information which is either irrelevant or hinders automated processing.

Data scraping is most often done either to

advertisement revenue

, or the loss of control of the information content.

Data scraping is generally considered an

program crashes

.
However, setting up a data scraping pipeline nowadays is straightforward, requiring minimal programming effort to meet practical needs (especially in biomedical data integration).[1]

Technical variants

Screen scraping

A screen fragment and a screen-scraping interface (blue box with red arrow) to customize data capture process.

Although the use of physical "
dumb terminal" IBM 3270s is slowly diminishing, as more and more mainframe applications acquire Web interfaces, some Web applications merely continue to use the technique of screen scraping to capture old screens and transfer the data to modern front-ends.^[2]

Screen scraping is normally associated with the programmatic collection of visual data from a source, instead of parsing data as in web scraping. Originally, screen scraping referred to the practice of reading text data from a computer display
memory through its auxiliary port
, or by connecting the terminal output port of one computer system to an input port on another. The term screen scraping is also commonly used to refer to the bidirectional exchange of data. This could be the simple cases where the controlling program navigates through the user interface, or more complex scenarios where the controlling program is entering data into an interface meant to be used by a human.
As a concrete example of a classic screen scraper, consider a hypothetical legacy system dating from the 1960s—the dawn of computerized
programmers with experience in a 50-year-old computer system. In such cases, the only feasible solution may be to write a screen scraper that "pretends" to be a user at a terminal. The screen scraper might connect to the legacy system via Telnet, emulate the keystrokes needed to navigate the old user interface, process the resulting display output, extract the desired data, and pass it on to the modern system. A sophisticated and resilient implementation of this kind, built on a platform providing the governance and control required by a major enterprise—e.g. change control, security, user management, data protection, operational audit, load balancing, and queue management, etc.—could be said to be an example of robotic process automation software, called RPA or RPAAI for self-guided RPA 2.0 based on artificial intelligence
.
In the 1980s, financial data providers such as
VAX/VMS called the Logicizer.^[3]

More modern screen scraping techniques include capturing the bitmap data from the screen and running it through an
GUI applications, with querying the graphical controls by programmatically obtaining references to their underlying programming objects
. A sequence of screens is automatically captured and converted into a database.
Another modern adaptation to these techniques is to use, instead of a sequence of screens as input, a set of images or PDF files, so there are some overlaps with generic "document scraping" and report mining techniques.
There are many tools that can be used for screen scraping.^[5]

Web scraping

Main article: Web scraping

Amazon AWS and Google provide web scraping tools, services, and public data available free of cost to end-users. Newer forms of web scraping involve listening to data feeds from web servers. For example, JSON is commonly used as a transport storage mechanism between the client and the webserver. A web scraper uses a website's URL to extract data, and stores this data for subsequent analysis. This method of web scraping enables the extraction of data in an efficient and accurate manner.^[7]

Recently, companies have developed web scraping systems that rely on using techniques in DOM parsing, computer vision and natural language processing to simulate the human processing that occurs when viewing a webpage to automatically extract useful information.^[8]^[9]
Large websites usually use defensive algorithms to protect their data from web scrapers and to limit the number of requests an IP or IP network may send. This has caused an ongoing battle between website developers and scraping developers.^[10]

Report mining

Report mining is the extraction of data from human-readable computer reports. Conventional
end-user licence costs for ERP customers, and can offer very rapid prototyping and development of custom reports. Whereas data scraping and web scraping involve interacting with dynamic output, report mining involves extracting data from files in a human-readable format, such as HTML
, PDF, or text. These can be easily generated from almost any system by intercepting the data feed to a printer. This approach can provide a quick and simple route to obtaining data without the need to program an API to the source system.

See also

Comparison of feed aggregators

Data cleansing

Data munging

Importer (computing)

Information extraction

Mashup (web application hybrid)

Metadata

Open data

Search engine scraping

Web scraping

References

PMID 23632294
.

^ "Back in the 1990s.. 2002 ... 2016 ... still, according to Chase Bank, a major issue. Ron Lieber (May 7, 2016). "Jamie Dimon Wants to Protect You From Innovative Start-Ups". The New York Times.

^ Contributors Fret About Reuters' Plan To Switch From Monitor Network To IDN, FX Week, 02 Nov 1990

^ Yeh, Tom (2009). "Sikuli: Using GUI Screenshots for Search and Automation" (PDF). UIST. Archived from the original (PDF) on 2010-02-14. Retrieved 2015-02-16.

^ "What is Screen Scraping". June 17, 2019.

S2CID 237719804
.

ISBN 978-1-5386-8260-9
.

^ Metz, Rachel (June 1, 2012). "A Startup Hopes to Help Computers Understand Web Pages". MIT Technology Review. Retrieved 1 December 2014.

^ VanHemert, Kyle (Mar 4, 2014). "This Simple Data-Scraping Tool Could Change How Apps Are Made". WIRED. Archived from the original on 11 May 2015. Retrieved 8 May 2015.

^ ""Unusual traffic from your computer network"". Google Search Help. Retrieved 2017-04-04.

^ Scott Steinacher, "Data Pump transforms host data", InfoWorld, 30 August 1999, p55

Further reading

Hemenway, Kevin and Calishain, Tara. Spidering Hacks. Cambridge, Massachusetts: O'Reilly, 2003.
ISBN 0-596-00577-6
.

v
t
e
Data

Acquisition

Augmentation

Analysis

Archaeology

Big

Cleansing

Collection

Compression

Corruption

Curation

Degradation

Editing

ETL/ELT
Extract

Transform

Load

Farming

Format management

Fusion

Integration

Integrity

Library

Lineage

Loss

Management

Migration

Mining

Philanthropy

Pre-processing

Preservation

Processing

Protection (privacy)

Publishing

Recovery

Reduction

Retention

Quality

Science

Scraping

Scrubbing

Security

Stewardship

Storage

Synchronization

Validation

Warehouse

Wrangling/munging

v
t
e
Information security
Related security categories

Computer security

Automotive security

Cybercrime
Cybersex trafficking

Computer fraud

Cybergeddon

Cyberterrorism

Cyberwarfare

Electromagnetic warfare

Information warfare

Internet security

Mobile security

Network security

Copy protection

Digital rights management

vectorial version
Threats

Adware

Advanced persistent threat

Arbitrary code execution

Backdoors

Hardware backdoors

Code injection

Crimeware

Cross-site scripting

Cross-site leaks

DOM clobbering

History sniffing

Cryptojacking

Botnets

Data breach

Drive-by download

Browser Helper Objects

Viruses

Data scraping

Denial-of-service attack

Eavesdropping

Email fraud

Email spoofing

Exploits

Hacktivism

Insecure direct object reference

Keystroke loggers

Logic bombs

Time bombs

Fork bombs

Zip bombs

Fraudulent dialers

Malware

Payload

Phishing
Voice

Polymorphic engine

Privilege escalation

Ransomware

Rootkits

Scareware

Shellcode

Spamming

Social engineering

Spyware

Software bugs

Trojan horses

Hardware Trojans

Remote access trojans

Vulnerability

Web shells

Wiper

Worms

SQL injection

Rogue security software

Zombie

Defenses

Application security
Secure coding

Secure by default

Secure by design
Misuse case

Computer access control
Authentication
Multi-factor authentication

Authorization

Computer security software
Antivirus software

Security-focused operating system

Data-centric security

Obfuscation (software)

Data masking

Encryption

Firewall

Intrusion detection system
Host-based intrusion detection system (HIDS)

Anomaly detection

Security information and event management (SIEM)

Mobile secure gateway

Runtime application self-protection

Site isolation

Retrieved from "https://en.wikipedia.org/w/index.php?title=Data_scraping&oldid=1214697307"

[1] PMID 23632294
.

[2] "Back in the 1990s.. 2002 ... 2016 ... still, according to Chase Bank, a major issue. Ron Lieber (May 7, 2016). "Jamie Dimon Wants to Protect You From Innovative Start-Ups". The New York Times.

[3] Contributors Fret About Reuters' Plan To Switch From Monitor Network To IDN, FX Week, 02 Nov 1990

[4] Yeh, Tom (2009). "Sikuli: Using GUI Screenshots for Search and Automation" (PDF). UIST. Archived from the original (PDF) on 2010-02-14. Retrieved 2015-02-16.

[5] "What is Screen Scraping". June 17, 2019.

[6] S2CID 237719804
.

[7] ISBN 978-1-5386-8260-9
.

[8] Metz, Rachel (June 1, 2012). "A Startup Hopes to Help Computers Understand Web Pages". MIT Technology Review. Retrieved 1 December 2014.

[9] VanHemert, Kyle (Mar 4, 2014). "This Simple Data-Scraping Tool Could Change How Apps Are Made". WIRED. Archived from the original on 11 May 2015. Retrieved 8 May 2015.

[10] ""Unusual traffic from your computer network"". Google Search Help. Retrieved 2017-04-04.

[11] Scott Steinacher, "Data Pump transforms host data", InfoWorld, 30 August 1999, p55

[2]

[3]

[5]

[7]

[8]

[9]

[10]

Description

Technical variants

Screen scraping

Web scraping

.mw-parser-output .vanchor>:target~.vanchor-text{background-color:#b1d2ff}Report mining

See also

References

Further reading

Report mining