User:Jakkbl/sandbox

This is the user sandbox of Jakkbl. A user sandbox is a subpage of the user's user page. It serves as a testing spot and page development space for the user and is not an encyclopedia article. Create or edit your own sandbox here.

Other sandboxes: Main sandbox | Template sandbox

Finished writing a draft article? Are you ready to request review of it by an experienced editor for possible inclusion in Wikipedia? Submit your draft for review!

An Archival Resource Key (ARK) is a

Uniform Resource Locator (URL) that is a multi-purpose persistent identifier

for information objects of any type. An ARK contains the label ark: after the URL's hostname, which sets the expectation that, when submitted to a web browser, the URL terminated by '?' returns a brief metadata record, and the URL terminated by '??' returns metadata that includes a commitment statement from the current service provider. The ARK and its inflections ('?' and '??') provide access to three facets of a provider's ability to provide persistence.

Implicit in the design of the ARK scheme is that persistence is purely a matter of service and not a property of a naming syntax. Moreover, that a "persistent identifier" cannot be born persistent, but an identifier from any scheme may only be proved persistent over time. The inflections provide information with which to judge an identifier's likelihood of persistence.

ARKs can be maintained and resolved locally using open source software such as Noid (Nice Opaque Identifiers) or via services such as EZID and the central N2T (Name-to-Thing) resolver.

History

The ARK identifier scheme, created by John Kunze, has roots in IETF working group discussions, starting in 1992, about the worrisome rate at which URLs were "breaking" (returning 404 Not Found errors). The IETF (which defines Internet standards such as TCP/IP) had coined the term, URN, and the umbrella term, URI (ref), inspired by its successful use of indirection in DNS to stabilize internet host references. The URN was envisioned as an indirect identifier to be used as a "persistent identifier" for web resources (a PID). Users were meant to bookmark URNs because, unlike URLs, they would in theory never break, thanks to a resolver that would transparently forward a stable URN to whatever the current best URL was. Under time pressure and without having defined precisely what the URN was, the IETF's URI Working Group decided that persistence was a problem for the URN, not the URL (ref), which allowed the delayed URL specification to move forward as an Internet standard.

The idea of using a resolver for indirection was common to other PID schemes, and by 2000 there were several of them -- URN, PURL, Handle, and DOI -- along with several important contradictions. Their primary offering was the means to do URL resolution (indirection, forwarding), but that was already available for free to anyone with a web server. Their backers held that URLs, as well as their embedded references to server names and to "http://", were inherently unstable, but none of their identifiers were actionable unless carried inside URLs (which they eventually began promoting). Only PURL embraced the URL from the beginning. They all focussed on PIDs as largely a technical problem, but their services provided only trivial technology (URL forwarding and a GUI to maintain a two-column dataset) and no help with the large human effort required of scheme adopters to update forwarding information and, more importantly, to protect the content itself from human error, natural disaster, legal challenge, deliberate attack, social upheaval, bankruptcy, etc.

Discussing PIDs was hard because there was (and still is) no agreed terminology, even for a word like "persistence" (ref). Moreover, PIDs were abstract names, ideally conceived to survive future trends and unverifiable predictions. Specious arguments began to propagate a mystique that PID resolution was high-tech and could guarantee access. While laughable to network and legal experts, such arguments drove scheme adoption among non-technical content providers, for example, in libraries and publishing. At the request of the US National Library of Medicine (NLM), in 2000 Kunze recommended that no existing PID scheme would be more effective than just using URLs that were carefully chosen and managed according to a list of criteria that later, after he transitioned to the California Digital Library (CDL) in 2001, became the basis of the ARK scheme. This recommendation was consistent with the notion of "

Cool URIs

".

Beyond the conceptual puzzle of persistence, there were also general concerns leading up to ARKs. Only URNs and PURLs were as free and easy to create as URLs. Each of the schemes involved a "silo" resolver using a simple database that was modified to reject competitors' identifiers, a design that was inconsistent with open architectures and standards. Finally, the widespread focus on technical indirection distracted attention and resources away from the much larger socio-organizational problems of safeguarding links. Installed indirection systems only prevent broken links when they're updated by institutions that are focussed on awareness, commitment, solvency, etc. When schemes require paying fees and maintaining extra infrastructure, as with local Handle servers, it distracts and draws resources away from these more fundamental problems.

Each scheme raised unique concerns. The PURL system relied on centralized identifier registration and routed all access through one choke point. PURLs were visually indistinguishable from ordinary URLs, so one could not easily tell if a given URL was meant to be a PID. On the other hand, PURL system was free to all comers, pragmatically embraced the URL, had an elegantly simple implementation of URL forwarding, and had a "partial redirect" feature (borrowed later by the N2T.net resolver).

The Handle scheme was emphatically rejected by the IETF. It was antithetical to a core principle that Internet standards must not endorse access control by any one entity over the networked resources of another entity (ref). While the WWW had recently introduced the liberating possibility that anyone with a web server could disseminate content without fees, centralized permission, or exclusive control by an intermediary, the Handle scheme rolled back those benefits. Backers claimed that Handles were necessary for content that was worth preserving and that its high-tech resolver architecture would keep them from breaking. The Handle system was closed-source, required each provider to maintain a local Handle server, and charged an annual fee.

The DOI scheme was built on top of the Handle scheme, which did find a receptive audience with publishers in 2000. Dominating the DOI for the next decade, the publishers began to convince creators and providers that content could be persistent and taken seriously if and only if it had a PID issued by the traditional publishing industry, namely, a DOI. To obtain DOIs for content required an annual subscription as well as per-identifier fees. Especially for scholarly journals in the global North, control over dissemination was being reasserted by publishers.

By 2000 the URN was stalled. While the IETF had published several URN specifications (syntax, registration, resolution), URN resolution had not been implemented other than with URL-based methods used by the other schemes (nor would any other method subsequently emerge). This raised the existential question of whether a URN inside a URL was actually a URN. If URLs were not inherently unstable, was it an error for the working group to have hastily decided in 1995 that persistence was the job of something called the URN? If so, the URN, and hence the URI, were unnecessary, and "cool URIs" were just cool URLs. But the term "URI" was then becoming a foundational piece of Internet terminology that would be hard to change. As long as it needed to distance itself from the URL, the URN was stuck. Unfortunately, also stuck was the notion of a stable reference for web resources that was free, open, and backed by Internet standards.

In 2001, the University of California was looking for a PID solution, and Kunze recommended using URLs maintained according to certain criteria and containing the internal label, "ark:", to distinguish them from ordinary URLs. The specification of this "ARK scheme" was first published as an Internet Draft in February 2001. The idea was that any organization with commitment, a carefully chosen server name, and a deliberate approach to assigning and maintaining content names could create and redirect URL-based PIDs for free with their own web server. All those organizational traits were already required to work with any other scheme, but with ARK there were fewer costs and constraints. The CDL registered to use ARKs in 2002 and the first real ARK resolver was the web server at ark.cdlib.org. The open source Noid (Nice Opaque IDentifiers) software for minting, binding, and resolving identifiers (from any scheme) was first released in 2004. That same year also saw institutions join the ARK community that would become major users: the University of California campuses at Berkeley, San Diego, and San Francisco, as well as Portico, the University of North Texas, and the Internet Archive.

By 2006, the registry of ARK assigning institutions had grown to 23. This included the National Library of France (BnF), which set a high standard of practice and implementation that influenced the strong uptake of ARKs in Francophone regions of the world. In that year the Noid software matured and became the platform under the general purpose noid.cdlib.org resolver, which was used by several groups in CDL and the Internet Archive. The resolver also started to include forwarding rules for "prefixed" identifiers for a half dozen types, such as the PubMed and Enzyme Classification databases. One principle of Noid's design was to avoid privileging any one type of identifier, with the result that Noid was installed by external organizations to maintain ARKs, Handles, DOIs, etc. With the release of CDL's EZID service in 2010, the resolver acquired the more generic name, n2t.net (Name-to-Thing), which was inspired by the URN mapping operations n2r, n2l, and n2c envisioned in 1997 (RFC2168). They in turn referenced the term "URC" term coined by Kunze in 1992 (https://www.ietf.org/proceedings/25.pdf), which became the "ERC" metadata package in the ARK specification.

In 2010, n2t.net was storing DOIs and URNs along with ARKs. In that year an MOU was established between BnF and CDL in which the BnF agreed to regularly harvest a preservation replica of the NAAN registry (of ARK assigning institutions). Joined shortly afterwards by the US National Library of Medicine, both organizations continue to hold replicas of the registry maintained at CDL. In 2015 a successful trial to demonstrate resolver system backup was conducted with Crossref in which 61 million of that organization's DOIs were stored in n2t.net, the idea being to provide reciprocal backup for ARKs. By 2017 an MOU was established between CDL and EMBL-EBI in which the n2t.net resolver would provide access to over 600 identifier prefixes, including "ark:", harvested regularly from identifiers.org.

In 2018 the California Digital Library and DuraSpace announced a collaboration, called ARKs-in-the-Open, aimed at building an open, international community around ARKs and their use as persistent identifiers in the open scholarly ecosystem. By this time, over 500 institutions (research, not-for-profit, private, government) across the world had registered to use ARKs, and they had created an estimated 3.2 billion ARKs. The year 2019 saw the establishment of an advisory group and three working groups.

Structure

[http://NMAH/]ark:/NAAN/Name[Qualifier]

NAAN: Name Assigning Authority Number - mandatory unique identifier of the organization that originally named the object
NMAH: Name Mapping Authority Host - optional and replaceable hostname of an organization that currently provides service for the object
Qualifier: optional string that extends the base ARK to support access to individual hierarchical subcomponents of an object,^[1] and to variants (versions, languages, formats) of components.^[2]

Name Assigning Authority Numbers (NAANs)

A complete NAAN registry

US National Library of Medicine

. In June of 2018 it contained over 530 entries, including the following:

12025:
National Library of Medicine
12148:
Bibliothèque Nationale de France
13030: California Digital Library
13038: World Intellectual Property Organization
13960: Internet Archive
14023: Revista de Arte, Ciência e Comunicação
14365: Colectica
15230: Rutgers University
17101:
Centre for Ecology & Hydrology
20775: University of California, San Diego
21198:
University of California Los Angeles
26678: Frantiq
25031: University of Kansas
25593: Emory University
25652:
École nationale supérieure des mines de Paris
26677: Library and Archives Canada
27927: Portico/Ithaka Electronic-Archiving Initiative
28722:
University of California Berkeley
29114:
University of California San Francisco
32150: University of Durham
35911:
IEEE
39331:
National Library of Hungary
45487: Russian Linguistic Bulletin (Российский Лингвистический Бюллетень)
48223: UNESCO
52327:
Bibliothèque et Archives Nationales du Québec
61001: University of Chicago
62624: New York University
64269: Digital Curation Centre
65323: University of Calgary
67375: Institut de l'information scientifique et technique
67531: University of North Texas
78319: Google
78428: University of Washington
80444:
Northwest Digital Archives
81055: British Library
88435: Princeton University
87925: University College Dublin

Generic Services

Three generic ARK services have been defined. They are described below in protocol-independent terms. Delivering these services may be implemented through many possible methods given available technology (today's or future).

Access Service (access, location)

Returns (a copy of) the object or a redirect to the same, although a sensible object proxy may be substituted (for instance a table of contents instead of a large document).
May also return a discriminated list of alternate object locators.
If access is denied, returns an explanation of the object's current (perhaps permanent) inaccessibility.

Policy Service (permanence, naming, etc.)

Returns declarations of policy and support commitments for given ARKs.
Declarations are returned in either a structured metadata format or a human readable text format; sometimes one format may serve both purposes.
Policy subareas may be addressed in separate requests, but the following areas should be covered:
- object permanence,
- object naming,
- object fragment addressing, and
- operational service support.

Description Service

Returns a description of the object. Descriptions are returned in either a structured metadata format or a human readable text format; sometimes one format may serve both purposes.
A description must at a minimum answer the who, what, when, and where questions concerning an expression of the object.
Standalone descriptions should be accompanied by the modification date and source of the description itself.
May also return discriminated lists of ARKs that are related to the given ARK.

Notes and references

^ Hierarchy qualifiers begin with a slash character.
^ Variant qualifiers begin with a dot character.
^ Name Assigning Authority Number registry

External links

Category:Electronic documents Category:Identifiers Category:Index (publishing)

[1] Hierarchy qualifiers begin with a slash character.

[2] Variant qualifiers begin with a dot character.

[3] Name Assigning Authority Number registry

[1]

[2]