URI Resource Pages
About URIs
Contents
Definition of URI
URI stands for Uniform Resource Identifier. There is
no widely accepted definition; the most general is:
A URI identifies a resource |
Implicit in the term "Uniform Resource Identifier" are
three concepts:
So what is a Resource
Anyway?
There has never been an accepted definition of "resource" (in
the URI context); it can
be almost anything: a web page, document, database, image, service,
(recursively) a collection of resources, a physical object (person,
book, etc.), even a concept. One proposed definition is "anything
that has identity". (Not very useful if we define URI as we
have above: "a URI identifies a resource". That essentially
reduces the "resource" defintion to "anything a
URI identifies" and we have rather circular definitions. In
fact probably not useful in anycase, as it's difficult to say what
does and does not have identity.) Another proposed definition is "anything
that can be named" (suffers because of "name vs. location" controversy;
more below). Other proposed definitions are
not any more useful than these. So, "resource" is best
left undefined.
URI Schemes
URIs are distinguished by their "scheme". The following
character strings are URIs:
- http://www.loc.gov
- telnet://rs8.loc.gov
- mailto:someone@loc.gov
- ftp://ftp.loc.gov/pub/z3950/articles/kbr.ps
- z39.50s://melvyl.ucop.edu/cat
- info:srw/schema/1/mods-v3.0
- tag:hawke.org,2001-06-05:Taiko
- urn:nbn:de:gbv:089-3321752945
Their schemes, respectively, are 'http', 'telnet', 'mailto', 'ftp',
'z39.50s', 'info', 'tag' and 'urn'.
So the pattern is: a uri is a character string beginning with
a scheme name followed by a colon (':') and the remainder of the
uri is "scheme specific"; its interpretation depends
on the scheme.
Some uri schemes correspond to a specific protocol. "Protocol" in
this context means some rigorously defined procedure that describes
what is supposed to happen when you activate (click on) a uri of
that scheme. In that sense, in the above examples schemes 1-5
are protocols, 'info' and 'tag' (6, 7) are not, and for
'urn' (8) it depends on the URN namespace (see
below).
Thus:
- For
some, the scheme name is the same as the protocol name, e.g.
'http', 'telnet',
'ftp';
- for others, the scheme name is not a protocol name but
still corresponds to a protocol, e.g.'z39.50s' and 'mailto';
- some schemes do not correspond to a protocol at all, for example,
'info' and 'tag'; and
- in some cases, the scheme may correspond to multiple protocols,
for example 'urn'.
There is a list of URI schemes at http://www.iana.org/assignments/uri-schemes,
the Official IANA Registry of URI Schemes.
IANA is
the "Internet Assigned-numbers Authority". Links
to IANA registries
What Does a URI Do?
Does a URI identify, locate, retrieve, dereference, name, resolve...
or what?
- Identify or Locate? Some say a URI identifies a
resource. Some say it locates a resource.
- ... or Name? There is a distinction among
these three -- locate, name, identify -- some say it is too subtle
to formalize, others disagree.
- Locate or Retrieve? This has more to do with URLs in
particular than URIs in general. Some say URLs locate a
resource, that is, they identify its location. Some say they retrieve the
resource.
- Retrieve or "dereference"? See below.
- Resolve? see below.
It may simplify to consider that a URI does one or the other (or
both) of two things: identify and/or dereference.
Identify
An important class of URIs simply identify a resource,
and are not intended to retrieve (/dereference) or locate it. Some
of these are simply pure identifiers, serving the same purpose
as ISO OID did before URIs came along. 'info'
URIs are in this class.
Before 'info' came along people tended to use 'http' when a pure
identifier was needed (for example, for rdf). And in fact there
are many of these legacy 'http' identifiers in use today, and even
more being assigned. Some people think this is legitimate; others
(particularly, many people in the library and publishing community)
feel that 'http' is not a good scheme to use for a pure identifier,
because an 'http' URI is a URL, and as such must be actionable.
Look at the list of SRW
Schema identifiers and note that some of these are 'http'
URIs and others, 'info'. The owner of the schema gets to decide
which URI scheme to use.
Dereference
When you "click" on a URL (see below),
something is supposed to happen; typically a web page appears.
One might say that a retrieval has occurred - your web
client has retrieved the resource (the web page) from the web server.
The information retrieval community likes to think of this as retrieval but
that term has different connotations in other communities. Some
argue that if you retreive a resource it no longer resides at the
location from which it was retrieved - it cannot be in two places
at once (the "retrieve a book" metaphor). Sometimes the
awkward phrase "retrieve a representation" is used, but
more popular is the term dereference which means roughly
the same thing.
Google has some interesting definitions of "dereference":
- Access the value pointed to by a pointer.
- Use a reference to access a data value.
- Retrieve the value stored at the referenced address. (So
here, retrieve and dereference really are the same!)
- (and, interestingly) Resolve a reference. (So here,
dereferencing and resolution are the same!) See resolution.
URIs, URLs, and URNs
As we said above, and almost everyone agrees, when you click on
a URL something is supposed to happen. However, not all URIs are
URLs -- not every URI is actionable in this sense
(that when you click on it something happens) -- in particular
and as we noted above, an 'info' URI is (in general) not actionable.
Originally, the URL, "Uniform Resource Locator", was
conceived; the URI was a generalization of the URL concept, along
with the URN, "Uniform Resource Name", which was an attempt
to define (more or less) "persistent" identifiers. So
for a time it was believed that URIs were partitioned into two
classes, URLs and URNs. (And for a short time, three, with the
addition of the URC, "Universal Resource Citation", but
that never went anywhere.) However, a different view held that
the important distinction was between URI and URL (URIs identify
and
URLs locate) and URNs did not even fit into this model.
Much of this was sorted out (see RFC
3305 also published as W3C
Note 21 September 2001) when it was agreed:
- 'urn' is simply a URI scheme, like 'http' and 'info';
- 'url' is not; but
- URL is a useful but informal concept, refering to "a
type of URI that identifies a resource via a representation of
its primary access mechanism (e.g., its network
'location'), rather than by some other attributes it may have.
Thus ..., 'http:' is a URI scheme. An http URI is a URL. The
phrase 'URL scheme' is now used infrequently, usually to refer
to some subclass of URI schemes which exclude URNs." (Quote
from WSC Note.)
Resolution
A URN is (theoretically) a persistent identifier for a resource,
independent of location or access method.
URN Conceptual Model
Conceptually, a URN maps to one or more URLs for the resource.
When a user activates (clicks on) a URN the browser finds
the the set of associated URLs, selects one (perhaps based
on location, or perhaps, access method), and then attempts
to retrieve the resource. If the attempt fails the browser
might try another URL in the list. All of this is transparent
to the user.
If the resource is replicated on an additional server,
a URL is added to the list. If the resource is removed
from a server, a URL is deleted. If there is a single copy
of the resource, and it is moved, the URL is updated. In
any case the URN never changes. |
The process of finding the list of URLs corresponding to a URN,
and selecting one, as described in the model above, is called resolution.
URN Namespaces and Syntax
The universe of URNs is partitioned into namespaces. Each is assigned
a namespace identifier (NID). See the IANA
Registry of URN Namespaces. So a URN consists of:
- the scheme - 'urn'
- colon separator - ':'
- the NID, e.g. 'nbn'
- another colon separator - ':'
- a namespace specific string (NSS) e.g. 'de:gbv:089-3321752945'
So the URN for the NBN (National Bibliographic
Number) 'de:gbv:089-3321752945' is:
urn:nbn:de:gbv:089-3321752945 |
Note that the namespace specific string in this examples includes
additional colon (':') separators, as prescribed by the definition
for the specific URN namespace, in this case the National Bibliographic
Number, described in RFC
3188. For each URN NID there is an NSS definition. In
the NBN case, the definition prescribes additional structuring
(including a country and subauthority), however from the point-of-view
of the URN syntax, the NSS is
simply an opaque, flat string.
So what happened to URNs?
URNs never caught on because they tried to be too many things
and never really nailed down which:
- A persistent URL
- Location independent
- A resolution system
- A pure identifier
Persistence and location independence came to be thought of more
as social than technical problems. Other approaches were developed
rather than formalizing the URN concept.
The proposed URN resolution system never was fully developed.
And resolution is incompatible with the role of pure identifier.
More
about why URNs are not suitable as pure identifiers
|