Web archiving

Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. Web archivists typically employ web crawlers for automated capture due to the massive size and amount of information on the Web. The largest web archiving organization based on a bulk crawling approach is the Wayback Machine, which strives to maintain an archive of the entire Web.

The growing portion of human culture created and recorded on the web makes it inevitable that more and more libraries and archives will have to face the challenges of web archiving.

national archives

and various consortia of organizations are also involved in archiving culturally important Web content.

Commercial web archiving software and services are also available to organizations who need to archive their own web content for corporate heritage, regulatory, or legal purposes.

History and development

While curation and organization of the web has been prevalent since the mid- to late-1990s, one of the first large-scale web archiving project was the

Pandora, Tasmanian web archives and Sweden's Kulturarw3.^[4]^[5]

From 2001 to 2010,^[failed verification] the International Web Archiving Workshop (IWAW) provided a platform to share experiences and exchange ideas.^[6]^[7] The International Internet Preservation Consortium (IIPC), established in 2003, has facilitated international collaboration in developing standards and open source tools for the creation of web archives.^[8]

The now-defunct Internet Memory Foundation was founded in 2004 and founded by the European Commission in order to archive the web in Europe.^[2] This project developed and released many open source tools, such as "rich media capturing, temporal coherence analysis, spam assessment, and terminology evolution detection."^[2] The data from the foundation is now housed by the Internet Archive, but not currently publicly accessible.^[9]

Despite the fact that there is no centralized responsibility for its preservation, web content is rapidly becoming the official record. For example, in 2017, the United States Department of Justice affirmed that the government treats the President's tweets as official statements.^[10]

Methods of collection

Remote harvesting

The most common web archiving technique uses web crawlers to automate the process of collecting web pages.^[5] Web crawlers typically access web pages in the same manner that users with a browser see the Web, and therefore provide a comparatively simple method of remote harvesting web content. Examples of web crawlers used for web archiving include:

There exist various free services which may be used to archive web resources "on-demand", using web crawling techniques. These services include the Wayback Machine and WebCite.

Database archiving

Database archiving refers to methods for archiving the underlying content of database-driven websites. It typically requires the extraction of the

Bibliothèque Nationale de France and the National Library of Australia respectively. DeepArc enables the structure of a relational database to be mapped to an XML schema

, and the content exported into an XML document. Xinq then allows that content to be delivered online. Although the original layout and behavior of the website cannot be preserved exactly, Xinq does allow the basic querying and retrieval functionality to be replicated.

Transactional archiving

Transactional archiving is an event-driven approach, which collects the actual transactions which take place between a web server and a web browser. It is primarily used as a means of preserving evidence of the content which was actually viewed on a particular website, on a given date. This may be particularly important for organizations which need to comply with legal or regulatory requirements for disclosing and retaining information.^[13]

A transactional archiving system typically operates by intercepting every HTTP request to, and response from, the web server, filtering each response to eliminate duplicate content, and permanently storing the responses as bitstreams.

Difficulties and limitations

Crawlers

Web archives which rely on web crawling as their primary means of collecting the Web are influenced by the difficulties of web crawling:

The
robots exclusion protocol
may request crawlers not access portions of a website. Some web archivists may ignore the request and crawl those portions anyway.
Large portions of a web site may be hidden in the
Deep Web
. For example, the results page behind a web form can lie in the Deep Web if crawlers cannot follow a link to the results page.
Crawler traps
(e.g., calendars) may cause a crawler to download an infinite number of pages, so crawlers are usually configured to limit the number of dynamic pages they crawl.
Most of the archiving tools do not capture the page as it is. It is observed that ad banners and images are often missed while archiving.

However, it is important to note that a native format web archive, i.e., a fully browsable web archive, with working links, media, etc., is only really possible using crawler technology.

The Web is so large that crawling a significant portion of it takes a large number of technical resources. Also, the Web is changing so fast that portions of a website may suffer modifications before a crawler has even finished crawling it.

General limitations

Some web servers are configured to return different pages to web archiver requests than they would in response to regular browser requests. This is typically done to fool search engines into directing more user traffic to a website, and is often done to avoid accountability, or to provide enhanced content only to those browsers that can display it.

Not only must web archivists deal with the technical challenges of web archiving, they must also contend with intellectual property laws. Peter Lyman^[14] states that "although the Web is popularly regarded as a public domain resource, it is copyrighted; thus, archivists have no legal right to copy the Web". However national libraries in some countries^[15] have a legal right to copy portions of the web under an extension of a legal deposit.

Some private non-profit web archives that are made publicly accessible like WebCite, the Internet Archive or the Internet Memory Foundation allow content owners to hide or remove archived content that they do not want the public to have access to. Other web archives are only accessible from certain locations or have regulated usage. WebCite cites a recent lawsuit against Google's caching, which Google won.^[16]

Laws

In 2017 the

copyright laws may inhibit Web archiving. For instance, academic archiving by Sci-Hub falls outside the bounds of contemporary copyright law. The site provides enduring access to academic works including those that do not have an open access license and thereby contributes to the archival of scientific research which may otherwise be lost.^[18]^[19]

References

Citations

^ Truman, Gail (2016). "Web Archiving Environmental Scan". Harvard Library.
^
ISSN 0018-9219
.

^ "Inside Wayback Machine, the internet's time capsule". The Hustle. September 28, 2018. sec. Wayyyy back. Retrieved July 21, 2020.
S2CID 24303455
.

^
ISBN 978-1-4051-8588-2
.

^ "IWAW 2010: The 10th Intl Web Archiving Workshop". www.wikicfp.com. Retrieved August 19, 2019.

^ "IWAW - International Web Archiving Workshops". bibnum.bnf.fr. Archived from the original on November 20, 2012. Retrieved August 19, 2019.

^ "About the IIPC". IIPC. Retrieved April 17, 2022.

^ "Internet Memory Foundation : Free Web: Free Download, Borrow and Streaming". archive.org. Internet Archive. Retrieved July 21, 2020.

^ Regis, Camille (June 4, 2019). "Web Archiving: Think the Web is Permanent? Think Again". History Associates. Retrieved July 14, 2019.

^ "DeepArc". deeparc.sourceforge.net. Archived from the original on March 3, 2024.

^ "Xinq [Xml INQuiry]". National Library of Australia. Archived from the original on February 27, 2011.

OCLC 1064574312
.

^ Lyman (2002)

^ "Legal Deposit | IIPC". netpreserve.org. Archived from the original on March 16, 2017. Retrieved January 31, 2017.

^ "WebCite FAQ". Webcitation.org. Retrieved September 20, 2018.

^ "Social Media and Digital Communications" (PDF). finra.org. FINRA.

^ Claburn, Thomas (September 10, 2020). "Open access journals are vanishing from the web, Internet Archive stands ready to fill in the gaps". The Register.

S2CID 221340749
.

General bibliography

Brown, A. (2006). Archiving Websites: A Practical Guide for Information Management Professionals. London: Facet Publishing.
ISBN 978-1-85604-553-7
.

Brügger, N. (2005). Archiving Websites. General Considerations and Strategies. Aarhus: The Centre for Internet Research.
ISBN 978-87-990507-0-3. Archived from the original
on January 29, 2009.

Day, M. (2003). "Preserving the Fabric of Our Lives: A Survey of Web Preservation Initiatives" (PDF). Research and Advanced Technology for Digital Libraries. Lecture Notes in Computer Science. Vol. 2769. pp. 461–472.
ISBN 978-3-540-40726-3
.

Eysenbach, G. & Trudel, M. (2005). "Going, going, still there: using the WebCite service to permanently archive cited web pages". Journal of Medical Internet Research. 7 (5): e60.
PMID 16403724
.

Fitch, Kent (2003). "Web site archiving—an approach to recording every materially different response produced by a website". Ausweb 03. Archived from the original on July 20, 2003. Retrieved September 27, 2006.

Jacoby, Robert (August 19, 2010). "Archiving a Web Page". Archived from the original on January 3, 2011. Retrieved October 23, 2010.

Lyman, P. (2002). "Archiving the World Wide Web". Building a National Strategy for Preservation: Issues in Digital Media Archiving.

Masanès, J.), ed. (2006). Web Archiving. Berlin:
ISBN 978-3-540-23338-1
.

Pennock, Maureen (2013). Web-Archiving. DPC Technology Watch Reports. Great Britain:
ISSN 2048-7916
.

Toyoda, M.; Kitsuregawa, M. (2012). "The History of Web Archiving". doi:10.1109/JPROC.2012.2189920
.

External links

Library resources about
Web archiving

Online books

Resources in your library

Resources in other libraries

International Internet Preservation Consortium (IIPC)—International consortium whose mission is to acquire, preserve, and make accessible knowledge and information from the Internet for future generations

National Library of Australia, Preserving Access to Digital Information (PADI)

Library of Congress—Web Archiving

v
t
e
Digital preservation
Concepts

Artifactual value

Curation

Dark age

Obsolescence

Open Archival Information System

Techniques

Forensics

Emulation

By type

Artworks

Email

Video games

Websites

Organizations

Arctic World Archive

Computer museums

Digital Curation Centre

National Digital Information Infrastructure and Preservation Program (US)

Lists

Preservation initiatives

Timeline

Timeline of audio formats

Web archiving initiatives

Category

v
t
e
Cultural heritage and historic preservation
Topics
and issues

Agents of deterioration

Archival processing

Archaeological science

Archaeology

Archive

Bioarchaeology

Calendar (archives)

Conservation and restoration of cultural property

Conservation and restoration of immovable cultural property

Conservation and restoration of movable cultural property

Conservation science (cultural property)

Collecting

Collection (museum)

Collection catalog

Collections maintenance

Collections management

Collections management system

Cultural heritage

Cultural heritage management

Cultural property

Cultural property documentation

Cultural property exhibition

Cultural property imaging

Cultural property storage

Cultural resources management

Database preservation

Deaccessioning (museum)

Digital library

Digital photograph restoration

Digital preservation

Disaster preparedness (cultural property)

Film preservation

Finding aid

Fonds

Found in collection

Heritage asset

Heritage science

Inherent vice

Intangible cultural heritage

Integrated pest management (cultural property)

Inventory (library and archive)

Inventory (museum)

Media preservation

Midden

Mold control and prevention (library and archive)

Museum

Optical media preservation

Preservation (library and archive)

Preservation metadata

Preservation survey

Provenance

Repatriation

Ruins

Sustainable preservation

Treasure

Web archiving

Roles
and expertise

Archivist

Art dealer

Art handler

Auctioneer

Collection manager

Conservator-restorer

Conservation scientist

Conservation technician

Curator

Exhibition designer

Mount maker

Objects conservator

Paintings conservator

Photograph conservator

Preservationist

Registrar (cultural property)

Textile conservator

Methods
and techniques

Aging (artwork)

Anastylosis

Arrested decay

Cradling (paintings)

Cultural property radiography

Detachment of wall paintings

Desmet method

Display case

Digital repository audit method based on risk assessment

Historic paint analysis

Inpainting

Kintsugi

Leafcasting

Lining of paintings

Mass deacidification

Overpainting

Paleo-inspiration

Paper splitting

Reconstruction (architecture)

Rissverklebung

Textile stabilization

Transfer of panel paintings

UVC-based preservation

VisualAudio

Conservation
and restoration
of immovable
cultural property
by item type

Archaeological sites

Frescos

Heritage railways

Historic gardens

Outdoor artworks

Outdoor bronze objects

Outdoor murals

Conservation
and restoration
of movable
cultural property
by item type

Aircraft

Ancient Greek pottery

Bone, horn, and antler objects

Books, manuscripts, documents and ephemera

Ceramic objects

Clocks

Copper-based objects

Feathers

Film

Flags and banners

Fur objects

Glass objects

Herbaria

Human remains

Illuminated manuscripts

Insect specimens

Iron and steel objects

Ivory objects

Judaica

Lacquerware

Leather objects

Lighthouses

Metals

Musical instruments

Neon objects

New media art

Paintings

Painting frames

Panel paintings

Papyrus

Parchment

Performance art

Photographs

Photographic plates

Plastic objects

Rail vehicles

Road vehicles

Shipwreck artifacts

Silver objects

South Asian household shrines

Stained glass

Taxidermy

Textiles

Tibetan thangkas

Time-based media art

Totem poles

Vinyl discs

Woodblock prints

Wooden artifacts

Wooden furniture

Intangible
cultural heritage
preservation

Ancient music

Applied folklore

Dance notation

Early music

Endangered language

Ethnochoreology

Ethnomusicology

Ethnopoetics

Family folklore

Folklore

Folk art

Folk dance

Folk etymology

Folk instrument

Folk medicine

Folk music

Folk process

Folk play

Foodways

Folklore studies

Heritage language

Heritage language learning

Indigenous intellectual property

Indigenous culture

Indigenous language

Language death

Language preservation

Language revitalization

Living history

Oral history preservation

Preservation of meaning

Primitive music

Tradition preservation

Traditional knowledge

Notable
projects

Conservation issues of Pompeii and Herculaneum

Conservation-restoration of Ecce Homo by Elías García Martínez

Conservation-restoration of The Gross Clinic by Thomas Eakins

Conservation-restoration of Leonardo da Vinci's The Last Supper

Pompeian frescoes

Conservation-restoration of the Shroud of Turin

Conservation-restoration of the Sistine Chapel frescoes

Conservation-restoration of the Statue of Liberty

Conservation-restoration of the H.L. Hunley

Conservation response to flood of Arno, Florence

Modern and Contemporary Art Research Initiative

Preservation Metadata: Implementation Strategies

Authority control databases
International

FAST

National

Israel

Japan

Czech Republic

Retrieved from "https://en.wikipedia.org/w/index.php?title=Web_archiving&oldid=1225949150"