URI Resource Page
Latest News
We'll try to keep you up-to-date here, on the latest relevant
developments on URIs and identifiers.
XRI Update October 2008
XRI
Ballot Fails
June 2008
Uniform Access to Metadata May 2008
Identifers for Non-information
Resources April
2008
URI Template Draft Specification Revised March
2008
'info:' URI Scheme Adds 'lc' Namespace October
2007
XML
Namespace URIs April 2007
W3C Publishes TAG
Finding: Metadata
in URIs January 2007
URI Template October 2006
Metadata
in URIs August 2006
NISO Identifier Roundtable July
2006
'info:' URI Scheme Now Officially Approved December
2005
'info:' URI Scheme Approved -- and "Almost" Official November
2005
New Draft of "mailto:" URI Scheme October
2005
IRI:
the Internationalized Resource Identifier September
2005
DCC Workshop on Persistent Identifiers August
2005
A URI Framework for Controlled Vocabulary
Terms and Codes July 2005
Identifiers vs. Resolvable URIs June
2005
The 'tag' URI Scheme April 2005
Historical URI Schemes Draft Guidelines
Version 3 March 2005
URI Generic Syntax - Revision Complete February
2005
Duplicate Scheme Names: Good or Bad? January 2005
Proposed new Registration Procedure for URI Schemes
Introduces Provisional Class
of Schemes December 2004
New Draft of RFC 2396 December
2004
Registration of URI Schemes August 2004
XRI Update
(October 2008)
Discussions continue between OASIS and W3C over the fate
of XRI, the Extensible Resource
Identifer, developed by OASIS. Agreement hasn't been reached but
there is progress, and XRIs are somewhat clearer.
XRI development began in 2003 when OASIS created a Technical Committee
to develop an abstract identifier, to identify non-information
resources.
A
person, for example, is a non-information resource. You
can assign it an identifier, but you cannot retrieve it.
By contrast a document is an
information resource. You can assign it an identifier and retrieve
it by the identifier. For a non-information resource the best
you can do is retrieve a description.
An example of an XRI is:
=drummond
The '=' symbol signifies that the string following identifes a
person. In this case it is the person registered (in the XRI registry)
as 'drummond'.
Another example:
@boeing
The '@' symbol signifies that the string following identifies
an organization or company. This example assumes that Boeing
has registered itself as a company.
Top level objects -- people, companies -- are globally
registered, but for subordinate components there is delegation
of authority. Thus consider the hypothetical XRI:
@boeing*people*smith
Boeing
assigns the components, in this example, ‘people’ subordinate
to ‘boeing’,
and ‘smith’ subordinate to ‘people’.
XRIs, URIs and HXRIs
Originally, OASIS wanted XRI to be a URI scheme. Thus
for example the XRI @boeing would be expressed as an XRI
URI as follows:
xri://@boeing
Attempts to create an XRI URI scheme met with strong
resistance from the W3C and have been abandoned. But there
is yet another type of identifier associated with an XRI:
the HXRI, an 'http' URI (one of perhaps several) associated with
an XRI, used to retrieve a descriptor of the resource identified
by the XRI.
For example,
is an HXRI for the XRI =drummond,
where xri.net is an XRI resolver.
In general an HXRI for a given XRI is an 'http' URI where the
authority component ends with an XRI resolver, and the path component
is the XRI. Another HXRI for the same XRI is:
Where xri.freexri.com is a different XRI resolver.
Resolution
Resolution of HXRIs is crucial to the viability of XRIs.Now if
every resolution need go through a central (or one of a few central)
XRI resolver(s), resolution will be cumbersome.
It hasn't
been completely settled but resolution may work something like
this. Consider again the hypothetical XRI @boeing*people*smith. An
HXRI for this XRI might be
http://xri.boeing.com.xri.org//@boeing*people*smith
An XRI-aware client application will recognize xri.org as an XRI
resolver, strip off xri.org, and the resultant URI will be
http://xri.boeing.com//@boeing*people*smith
Thus the request goes straight to boeing.com and xri.org is
bypassed. And of course boeing.com knows what to do with it,
since it coined the HXRI to begin with. This is a (proposed) feature
of XRI resolvers: any company (Boeing for example) can coin URIs
using the domain name of the XRI resolver without registering that
URI within that domain.
Of course, a non-XRI-aware application will send the http request
to xri.org, who will strip off "xri.org" and pass on
the resulting URI to xri.boeing.com. So with this scheme
reliance on central resolvers, though not eliminated, would be
relieved.
XRI Ballot Fails
( June 2008)
XRI, the Extensible Resource Identifer,
is a new identifier type proposed by OASIS. It
is characterized as an "abstract" identifier,
independent of location and protocol.
OASIS recently balloted the
XRI syntax and resolution specifications;
both ballots failed. The balloting may have been influenced
by W3C, who has taken a position opposing
XRI. It seems that more discussion
between OASIS and W3C is needed before these specifications
can be approved.
The relationship of XRIs to URIs is somewhat unclear. An
XRI begins with an optional prefix, “xri://”, however,
'xri:' has not been proposed as a URI scheme yet. The intention is to
register the scheme with IANA, if XRI becomes an OASIS Standard.
Some
examples of XRIs, found in the syntax specification, are:
These examples are cited in the oreillynet.com blog posting of May 29, "XRIs
Bad, URIs Good" which points out that the spec doesn't give much of
a clue what these XRIs mean.
The W3C Technical Architecture Group has stated that they "are not satisfied
that XRIs provide functionality not readily available from http: URIs." Further
discussion between W3C and OASIS will likely occur in the coming months.
Uniform Access to Metadata
( May 2008)
Given a URI, obtain metadata for the resource it identifies.
The W3C has been discussing means of Uniform access to
metadata, where
"metadata" refers to bibliographic, access
control, and other types of metadata, or in general, a description of the
resource. The metadata (or description) is assumed to be logically
separate from the resource. Thus the need is to:
Develop a uniform method such that for a given URI we may obtain metadata
for the resource it identifies without necessarily accessing the resource.
he W3C has articulated the following motivations for this need:
- Uniform access to metadata is required because the specific method
for extracting metadata from content will vary wildly from one media type
to the next.
- Many media types (e.g. application/x-compressed) have
no place to put metadata at all.
- We want to be able to obtain metadata without necessarily retrieving
the content, because the resource might be something we don't want to load
(for reasons of size, license, or other kind of application suitability).
- Sometimes metadata is generated independently of content, and we don't
want to (or can't) modify existing content streams by inserting metadata
into it.
There are several proposed approaches:
- via the http link header
- via the html <link> element
- via the http 303 status code
- a new GET response header
- a new HTTP request method ('MGET' for example)
- Use HTTP content negotiation
- Via the Archival Resource
Key (To
creat a metadata link for an ARK, appends "?" to the URI.)
All of these approaches are controversial, some more than others. We
will report on further developments.
Identifers
for Non-information Resources
( April 2008)
What happens when you try to retrieve a resource that is inherently
not retrievable?
A URI, by definition, identifies a resource. The
definition of resource is “anything
that has identity” (admittedly, circular). Typically
a resource is a web page, a document, something “network
retrievable”.
But
a physical object - a person for example - has identity,
is therefore a resource (by definition), and can be assigned a
URI. Often,
such a URI is actually an ‘http:’ URL: http://www.example.com/joe-smith,
for example.
So what happens when that URL is seen on a web page and you click
on it? What do you want to happen?
There are also abstract resources. For example, the
Dublin Core concept of “title” is assigned
a URI: specifically, http://purl.org/dc/elements/1.1/title.
Dublin Core 'title' is assigned a URI because the concept must
be unambiguously identified, to distinguish it, not only from other
Dublin Core concepts ('contributor' for
example) but even from the concept of title within
a different metadata element set. Title, the concept,
is a resource - not a physical
resource, but you still can’t “retrieve” it
- It’s
called an abstract resource. And
its URI is an ‘http:’ URL, so (as we asked above)
what happens when that URL is seen on a web page and you click
on it? What do you want to happen?
We have three types of resources, then: physical,
abstract, and network-retrievable. Web architecture
distinguishes two types: information resources and non-information
resources. (Web
architecture doesn't
like the expression "retrievable resource", preferring "information
resource". Physical and abstract resources are combined into
the single category "non-information
resource".)
So the questions above, combined and rephrased:
When
the URI for a non-information resource is an ‘http:’ URL,
what happens when that URL is seen on a web page and you click
on it? What do you want to happen?
For context, first consider what happens when you click on a URL
for an information resourse.
When you click on a URL that you see on a web page, typically
an http request goes to the server named in the URL (e.g.
for the URL http://www.loc.gov/standards/sru,
the http request is to the server www.loc.gov). The
response to that request normally is the web page (or other
type of information-resource) named by the URL. A status
code is returned, along with or in place of the resource. For
a normal completion (the resource is supplied normally) the status
code is “200 – ok”. If
the resource isn’t there, the status code returned might
be “404 – not found”, or it might be “303 – see
other”. This latter code (theoretically) indicates that the
server thinks you should be redirected to a (specific) different
URL. In that case, the server should supply the suggested URL,
and you, the user, may never see
the “303” code, because your client might perform
the redirection automatically.
Now consider these three status codes - 200, 404, 303 - in the
context of a non-information resource; is one (or more) of these
appropriate?
- Certainly not "200
- ok". That basically means 'here comes the content
you requested"; when the status is '200', content is
expected to be included in the response. For a non-information
resource, there is no content.
- "404
- not found" might (on the surface) seem appropriate, but
that would cause chaos on the web. 404 statuses would be generated
by the billions; huge error reports would
be sent to web authors from network administrators telling them
to fix the broken links.
- A code of '303' would not seem to be appropriate
- "the
resource isn't at this URL, try this alternative URL instead" -
it isn't going to be there either.
However, status code '303' does seem to be what web architecture
prescribes for the attempted retrieval of a non-information resource. It
is a somewhat controversial approach, and is currently the subject
of re-examination.
The '303' status itself is the subject of some confusion. The
formal name ascribed to '303' status within the http protocol standard
is "see other" and the definition is, essentially, " The response
to the request can be found under a different URI".
However
- and this is the crucial point - some web architects assert that
the response available at the alternative URI is not the desired
resource itself (it couldn't be anyway, it's a non-information
resource) but rather, it is metadata about the resource.
This approach to dealing with the attempted retrieval of a non-information
resource is somewhat controversial and raises a number of questions,
even if you assume the latter-day interpretation of '303' status
(not all experts do) that the alternative URI points to metadata
about the desired resource. Before describing the controversy,
some additional background will help provide further
context. Two points:
- Some web (and semantic web) architects take the postion
that URIs in general should be "actionable". One
view is that any URI, no matter what the scheme, must be actionable.
That's an extreme view, not unanimously held. But most
hold the view that an 'http:' URI should always be actionable,
even for a non-information resource. When asked what should be
retrieved for a non-information resource, the answer invariably
is "a
description of the resource", i.e. metadata.
- There is increasing interest in developing a uniform method
for obtaining metadata for a resource without necessarily
having to retrieve the resource itself. This problem is characterized
by the W3C as Uniform Access to Metadata. (We plan to
explore this subject in a future report.)
These two points are obviously related, particularly in the case
of a non-information resource. In fact a current suggested method
addressing both problems is the '303' status: When an attempt is
made to retrieve a non-information resource, return http status
code '303' along with the URL of a description of the resource.
In fact, one of the key W3C architects is quoted as
saying:
200 means (basically) “Here comes the content of the
document you asked for” and 303 means “Here is
the URI of document ABOUT the thing you asked for".
This approach leaves answered some questions/issues.
- Suppose you have the URL of a known information resource.
How do you explicitly request the description, rather than the
resource itself.
- Suppose you have the URL of a known resource, but you don't
know if it is an information resource or a non-information resource.
Your request to retrieve that resource results in a '303' status.
You still don't know it it is an information resource or a non-information
resource. (It could be an information resource but the server
might not have it immediately available, so it does
the next best thing, supplies a description.)
- A server gets a request for a non-information resource. The
server knows about the resource, and so it returns a '303' status.
But the server does not have (nor does it know of) a description.
A '303' status should be accompanied by a URL (for a description
of the resource). Should the server simply return a '303'
status without an accompanying URL, contrary to the prescribed
approach?
We'll keep an eye on this and report further.
URI
Template Draft Specification Revised
( March 2008)
A new version of the Internet Draft, "URI Template", has
been released. (See URI Template, October
2006.)
A URI Template is a URI-like string that contains embedded expressions
(delimited by curly braces, '{' and '}'), called "expansions". The
template itself is not a URI; a
template processor replaces expansions with their calculated
value to produce a bonafide URI.
As a simple example, given the following URI Template:
http://www.loc.gov/standards/{standard}
And the following variable value:
standard = "mods"
The expansion of the URI Template is:
http://www.loc.gov/standards/mods
For a more complex example, look at the following template:
The part after '?' says: for each of the variables version, operation,
and query; join it in the form "variable=value", separated
by '&' (ampersand).
For the following variables:
- version: 1.1
- operation: searchRetrieve
- query: dinosaur
The expansion of the URI Template is the SRU Request:
http://z3950.loc.gov:7090/voyager?version=1.1& operation=searchRetrieve&query=dinosaur
URI Template is a Draft Internet Standard. It is available at http://www.ietf.org/internet-drafts/draft-gregorio-uritemplate-03.txt.
It is still a work in progress.
'info:'
URI Scheme Adds 'lc' Namespace
( October 2007)
The 'info' URI Scheme Registry added
the namespace 'lc' on October 15. See Info
URIs for Library of Congress Identifiers.
XML
Namespace URIs
( April 2007)
A new note addresses the question "what form should an XML
namespace URI take?" and compares and contrasts XML namespace
URIs with schema location URIs, and schema identifiers.See XML
Namespace URIs (and schema location URIs, and schema identifiers.
W3C Publishes TAG Finding: Metadata in URIs
( January 2007)
The W3C Technical Architecture
Group (TAG) has published a TAG Finding, The
use of Metadata in URIs, January 2, 2007.
An earlier report in August describes the finding, see Metadata in URIs. The August version was a draft finding,
not yet official, but the final published report is substantially
the same.
Comparing the earlier draft with the official publication, the
following changes are noted:
- Deleted from earlier draft:
- The section: " Avoid Dependencies on metadata".
- Good Practice: "Guess information from URIs
only when the consequences of an incorrect guess are acceptable".
- Added
- Good Practice: "When saving to filesystems
that use extensions to represent media types, user agents
MUST choose an extension that is constistent with the media
type of the representation."
- A new section: "Confusing or malicious metadata".
URI Template
(October 2006)
A new Internet
Draft describes the proposed URI Template, a string
that may be transformed into a URI by substituting values for
variables
that are embedded within the string. A URI template may be thought
of as representing a class of URIs; the template representation
is useful for conveying the general structure of URIs within
the class.
The following template could represent the class of LCCN
URIs:
Substituting an LCCN for "{lccn}" produces an LCCN
URI.For example substituting the LCCN n78089035 produces:
Template Variables
The draft also introduces template variables: the paramerized
components of a URI template. A list of values may
be input to a process representing the URI template, resulting
in the production of a URI within the class represented by the
URI
template.
For
example
consider
the template:
http://www.knuckleball.com/{a}.{b}
|
If the following table of variables and corresponding values is
input to the process corresponding to this template:
Variable |
Value |
a |
hoyt |
b |
wilhelm |
This URI will be produced:
http://www.knuckleball.com/hoyt.wilhelm
|
Metadata in URIs
(August 2006)
The W3C Technical Architecture
Group (TAG) has published a draft
finding, The
use of Metadata in URIs.
URI naming authorities often define structures allowing URIs to carry metadata
about identified objects. Metadata might include, for example, creation date,
MIME type, or even a digital signature to verify the integrity of the object’s
content. There are benefits to an orderly mapping from metadata to URI, and
naming authorities often use conventions that facilitate association of a URI
with its
corresponding object. Conventions based on filename or customer id are examples.
But there can be drawbacks.
The TAG finding discusses the suitability of embedding
metadata in a URI, and of inferring information from URI metadata.
Among the recommendations from the report are:
- URIs intended for direct use by people (as opposed to machines)
should be easy to understand, and should be suggestive of the
resource
actually
named.
- People should not infer or guess information from a URI unless
the consequence of a wrong guess is acceptable.
- Software should not rely on metadata inferred from a URI, except
as formally documented in a standard or applicable specification.
To briefly illustrate these points, consider the following (hypothetical)
advertisement, perhaps on the outside of a city bus:
For the Best Chicago Weather Information
go to
www.weather.com/chicago
|
As a printed URI, it is intuitive, easy to remember, and suggestive
of the resource identified.
Suppose the URI were instead:
http://www.weather.com/123Hx67v4gZ5234Bq5rZ
|
You would certainly find this annoying if
the URI were intended for human use. On the other hand it would
be a perfectly appropriate URI if it were intended strictly for
machine use.
123Hx67v4gZ5234Bq5rZ might be based on a database key facilitating
efficient access to the weather data at the server.
You might infer from the (first) URI that you could get the
weather in Boston, if you were to try:
That might work and it might not. The advertisement doesn't take
responsibility for providing weather information for anywhere other
than Chicago, but there is little risk in trying -- little risk
for a person. Software, on the other hand, should not
make this inference.
Suppose instead the advertisement said:
For the Best Local Weather Information
go to
www.weather.com/your-zip-code-here
|
Then, you can reasonably assume that a weather report is available
by substituting a zip code.
The full text of the draft finding is available at :http://www.w3.org/2001/tag/doc/metaDataInURI-31-20060609.html.
NISO Identifier
Roundtable
(July 2006)
NISO, the National Information
Standards Organization, held an Identifiers Roundtable, March 13-14,
at the National Library of Medicine in Bethesda, Maryland.
NISO, which has a long-held interest in identifiers (DOI, ISBN,
ISSN, SICI, 'info:', etc.) brought together experts representing
libraries, vendors,
information centers, e-learning systems, content providers and
aggregators. They discussed means to promote the long term sustainability
of identifiers: identifier-services
infrastructure, community and institutional support, business models,
registries; how to create, implement and support identifiers
and identifier systems; how to address confusion over identifers,
confusion which for several years has driven up the cost of developing
and managing
systems. Topics also included: "what makes a good identifier?",
identifier roles, identifier attributes,
identifiers and the web, imbedded identifiers, and standards needed.
Observations and Conclusions
Some observations and conclusions from the meeting:
- Identifier infrastructure must support services for creating
identifiers, binding them to objects, and resolution to obtain
the identified object or its metadata.
- Long term viability of identifiers requires viable business
models.
- Identifiers, particularly those exchanged between systems,
should be based on public standards, to prevent collisions between
identifiers developed in different contexts.
- There is less disagreement on the nature and properties of
identifiers than thought. Perceptions of disagreement arose from
differing contexts of discussion, in particular the differing
intended uses of specific identifiers.
- A registry of identifier schemes should be developed, including
associated services and policies for each scheme.
- The "info" URI registry should become a focal
point for community identifier needs.
Report
The workshop report is available at http://www.niso.org/news/events_workshops/ID-workshop-Report2006725.pdf
'Info:' URI
Scheme Now Officially Approved
(December 2005)
We reported last month that the IESG had
approved
the 'info:' URI scheme. It has now been listed by IANA (the Internet Assigned
Numbers Authority) at http://www.iana.org/assignments/uri-schemes.html,
their official register of URI schemes.
The register lists permanent, provisional,
and historical schemes. 'info:'
is conferred "permanent" status.
The 'info:' URI scheme is defined at http://www.ietf.org/internet-drafts/draft-vandesompel-info-uri-04.txt.
More information about this scheme is available on our 'info:'
Resource Page.
'Info:' URI
Scheme Approved -- and "Almost" Official
(November 2005)
The IESG (Internet Engineering Steering Group) has approved the
document:The "info" URI
Scheme for Information Assets with Identifiers in Public
Namespaces. This effectively means that 'info:' may now be considered
an approved URI scheme.
The action was announced November 3 in a memo from the IESG,
responsible for technical management of IETF activities and the
Internet
standards process, to
the IETF: Document
Action: 'The "info" URI Scheme for Information
Assets with Identifiers in Public Namespaces' to Informational
RFC.
As of November 16 "info" has not yet been added to the Official
IANA Registry of URI Schemes. This
may take several weeks because the process for maintaining the registry
is currently being revised.
New Draft of "mailto:"
URI Scheme
(October 2005)
In September we reported on the IRI (Internationalized Resource
Identifier). Now, a new internet
draft propose changes to the mailto
URI Scheme, for compatibility with IRIs.
The 'mailto' URI scheme defines the URI format for designating
an email address. In it's simplest form, a 'mailto:' URI
looks like:
The URI scheme is 'mailto:' and the resource identified by the
URI is an email address; in the above example it is "someone@somewhere".
Typically, when a user clicks on a 'mailto:' URI a browser will
construct an email message with the recipient field set as indicated
and otherwise empty, leaving the user to input the subject, text,
and other fields. For example, the email address nellie.fox@59sox.com,
might be coded in html as:
<a href="mailto:nellie.fox@59sox.com">Nellie
Fox</a> |
So that the recipient's name is visible on the web page and when
clicked the URI is activated and an email message is constructed.
Additional email parameters besides the recipient address may
be included in the URI, using the standard form for a URI query
and parameters - '?' preceding the query, and '&' separating
parameters. Thus for example the following ....
mailto:someone@somewhere?
subject=RSVP%20November%201%20Meeting
&
body=Will%20Attend |
... would generate an email message to:"someone@somewhere", with
subject: "RSVP
November 1 Meeting" and with body "Will Attend". Note that spaces,
which are not allowed to occur in a URI, are percent encoded -- they are replaced
by '%20' which is the escape character followed by the two-digit hex ASCII code
for space.
The new internet draft proposes to extend the existing 'mailto:' scheme definition
to allow characters to be percent-encoded based on UTF-8, offering a more consistent
way of dealing with non-ASCII characters.
For example, suppose you want "Culinary Café" to be the subject.
The eacute character is encoded in UTF-8 as C3A9, so this subject field would
be encoded as:
&subject=Culinary%20Caf%C3%A9 |
IRI: the Internationalized
Resource Identifier
(September 2005)
URIs have traditionally been limited to English words and
Latin characters. Many languages however are based on scripts
with alphabetic characters other than A-Z; these characters
are often transcribed into Latin letters for use in URIs.
These transcriptions introduce ambiguities.
The URI limitation
owes to historical limitations of operating
sytems and software. But nowadays,
software
can handle a wide variety of scripts and languages, and
people want to use them in identifiers.
RFC 3987 defines
the IRI -- Internationalized
Resource Identifier, a complement to the URI -- the
Uniform Resource Identifier.
The traditional URI is defined as a sequence of characters
from a limited subset of the US-ASCII character repertoire.
The permitted subset consists of uppercase
letters (A-Z), lowercase letters (a-z), decimal digits, and a few additional
characters. Some of these additional characters are reserved; most important
is the percent (%), used as an "escape" character. ‘%’ followed
by two hex digits in a URI is used to signify the ASCII character represented
by the two hex digits. For example, the space character is not allowed within
a URI string, so "%20" is used in its place. (Hex 20 is the ASCII value
for space.)
Thus the following is not a valid URI:
http://www.loc.gov/this and that
|
But this is:
http://www.loc.gov/this%20and%20that
|
This works only for characters that can be represented
by two hex digits,
i.e. the US ASCII set. Actually, the allowable URI characters – those that
may be “percent encoded” --
are those in the range hex 20 (space) through 7F (delete) -- equivalently, decimal
32 through 127.
IRIs are defined similarly to URIs, but the set of allowed
characters is extended beyond hex 7F. The IRI definition
provides a mechanism to transform any IRI to
a cannonical form which conforms to the URI syntax: Each character outside
of the allowable URI set is coverted to a sequence of one
or more UTF-8 characters,
each of which is then converted to ‘%xx’, where ‘xx’ is
the UTF-8 hex value for the character.
This mapping from an IRI to an URI produces a syntactically valid URI, and
it is an unambiguous transformation (applying it to an existing URI has no
effect)
and so every URI is, syntactically, a valid IRI.
DCC Workshop on Persistent Identifiers
(August 2005)
A Digital Curation Centre (DCC) Meeting on Persistent Identifiers
was held June 30 - July 1 at the University of Glasgow. A
meeting report is
available in Ariadne.
A URI Framework
for Controlled Vocabulary Terms and Codes
(July 2005) The library community
has an interest in the development of a framework to represent
controlled vocabulary terms and codes as
URIs. The framework could extend to data/metadata elements. No
such framework yet exists; this article takes a preliminary look
at some approaches considered.
- Assign ‘http:’ URIs
The main benefit of this approach is that the DNS facilitates the decentralized
creation of ‘http:’ identifiers. Another major feature is
that all browsers recognize ‘http:’
This approach has drawbacks though. Overuse of the ‘http:’ scheme,
out of convenience, causes considerable confusion - an ‘http:’ URI
is supposed to be resolvable, although protocol experts do point out that
according to a careful reading of the http protocol, this isn’t
strictly true. But not everyone is comfortable with that argument, as the ‘http:’ protocol
is defined by a huge, complex document that us simple folk will never read
carefully.
And the confusion caused by using an ‘http:’ URI for
a pure identifier is illustrated in the example (see Identifiers
vs. Resolvable URIs, June 2005) where an XML namespace is identified
by an 'http:' URI. People cannot resist the urge to click on http://www.loc.gov/z3950/agency/zing/srw/diagnostic/.
The maintenance agency for
that XML
namespace receives regular “broken link” reports, because it
is a pure identifier and does not resolve.
Another argument for using ‘http:’ is that any URI, even
for a controlled term, though seemingly an identifier, should resolve
to something,
even if only a human-readable definition of that term, and that “something” would
likely resolve via http. There are a couple counterarguments here. When
resolution is a secondary/incidental function, it often is neither
reliable
nor predictable. And when resolution is a primary function, then conceivably
(perhaps likely), a protocol other than HTTP would be used for resolving
terms.
One of the key features of the ‘http:’ scheme, decentralization,
may be a drawback for controlled terms, where decentralization isn’t
always necessarily desirable. It might be useful to have some coordination
exercised
over the authority to register identifiers, so that, for example, a given
term isn’t defined many times.
- New URI scheme
One of the alternatives to using ‘http:’ URIs is to define and
register a new URI scheme. The benefit of this approach is name recognition. For
example, consider the hypothetical URI:
This would be an identifier for the MARC organizational
code 'alaldse' (see "Sub Schemes" below), based on a URI
scheme, 'terms:' Casting codes in this manner would give the 'terms:' scheme
much more visibility
than if it were cast within the 'http:' framework, for example, as:
http://www.loc.gov/terms/marcOrg/alaldse |
The
drawbacks of this approach are (1) browsers are not going
to recognize an unlimited (or even a large) set of URI schemes,
and (2) URI schemes
are difficult to register.
- Sub Schemes
An alternative to the above two approaches -- (1) ‘http:’,
and (2) new URI scheme -- is to define sub schemes: “namespaces” within
existing schemes. Schemes that provide sub schemes are ‘urn:’ and ‘info:’
Note:
'info:' is not yet an approved URI scheme, so some may take issue with
its characterization as an "existing" scheme. We consider it
to be a defacto scheme.
Suppose (as above) we want to assign URIs for MARC
organization codes. The code 'alaldse'
(used in the above example) is used to represent: “Duck
Springs Elementary School (Attalla, AL)”. A possible URI for this
code would be:
info:terms/marcOrg/alaldse |
This assumes that an info namespace, “terms” is defined,
and also assumes a sub-authority “marcOrg” – all
of this is hypothetical, just an example. Or (with similar
assumptions on URN), it could be represented as:
urn:terms/marcOrg/alaldse |
Which should it be: 'info:' or 'urn:'? This might depend on whether
the URI is to be in the identifier or resolvable class. (see Identifiers
vs. Resolvable URIs, June 2005.)
There is talk of
a protocol function that would be defined for terms, so that, for example,
'urn:terms/marcOrg/alaldse' would actually resolve -- to the string: “Duck
Springs Elementary School (Attalla, AL)”. In that case, and
if this sort of resolution is considered a primary function of the URI, then
perhaps it should be cast as a URN.; If the URI is intended primarily as
an identifier, then perhaps it should be cast as an 'info:' URI. In general,
URNs are resolvable and 'info:' URIs are not (and there are exceptions for
both).
We look forward to exploring these ideas further in subsequent
articles.
Identifiers
vs. Resolvable URIs
(June 2005)
It is useful to distinguish a URI whose primary purpose is to serve as an identifier
from one whose primary role is to access a resource. Thus we have the identifier and resolvable URI
classes.
This is a useful abstraction for modelling, not a dichotomy -
often, there isn’t a clean distinction, and some URI schemes
don’t fall neatly into either class. Identifier URIs may
also resolve (for example, to a description of the identified object),
and certainly, resolvable URIs serve as identifiers. The distinction
is by primary role.
Identifier Class
In the identifier class we have for instance XML namespace
identifiers, and protocol objects. An example of both is found
in the following XML fragment.
<diagnostic xmlns="http://www.loc.gov/zing/srw/diagnostic/">
<uri>info:srw/diagnostic/1/38</uri>
<details>10</details>
<message>Too many boolean operators, the maximum is 10.
Please try a less complex query.</message>
</diagnostic> |
This is a portion of an SRW response; it return a diagnostic
to an SRW client. The URI "http://www.loc.gov/z3950/agency/zing/srw/diagnostic/" identifies
the namespace for the XML element <diagnostic>. The
URI "info:srw/diagnostic/1/38" is an identifier for the
actual diagnostic.
"http://www.loc.gov/z3950/agency/zing/srw/diagnostic/",
is not a resolvable URI; if you click on it you're told: “Page
Not Found”. It identifies an XML namespace, which is an abstraction
(it has no physical manifestation) so it would be meaningless to "resolve
to the namespace". That's not to say it couldn't resolve
to something (for example, a human-readable description of the
namespace) but whatever it resolved to would be unpredictable and
not machine-processible. So this URI is in the identifier class
-- whether it resolves or not is incidental; its primary purpose
is to identify.
Similarly, "info:srw/diagnostic/1/38" is an identifier,
in this case identifying a diagnostic condition, and presumably
the consumer of this URI (an SRW client) will look up this URI
in its local diagnostic table. This URI could resolve, for example
to the string: "Too many boolean operators, the maximum
is 10. Please try a less complex query." This would
serve no purpose in terms of protocol operation, though it might
be useful for a protocol developer, but again, that would be incidental
resolution only. Thus the primary purpose of this URI is to identify
an object and so it too is in the identifier class.
Here's another example, an identifier for an XML schema.
info:srw/schema/1/mods-v3.0 |
This is an identifier URI, not actionable. It is used within protocol
to identify a schema, in this case the MODS schema at http://www.loc.gov/standards/mods/v3/mods-3-0.xsd.
An SRW request includes a parameter allowing the client to request
that response records be returned according
to a specific schema. If
the MODS schema is requested, this URI is supplied as the value
of that parameter.
Resolvable Class
Resolvable URIs, referred to informally as URLs, retrieve an object, access
a resource – these are your basic actionable (also referred to as "dereferenceable")
URIs. When you click on http://www.loc.gov/standards/uri/news.html,
for example, your expectation is that the web page URI Resource Page:
Latest News will appear.
As we noted above, info:srw/schema/1/mods-v3.0 is an
identifier for a schema. That schema may resides in several places,
one is: http://www.loc.gov/standards/mods/v3/mods-3-0.xsd,
another: http://www.loc.gov/srw/mods-3-0.xsd;
these are both resolvable URIs, a third is http://www.loc.gov/z3950/agency/zing/srw/mods-3-0.xsd which
is a different URI but the same location as the second. These three
URIs serve well as locators, that is, for retrieving an
object, but not as identifiers because they are neither unique
nor persistent.
A schema
table lists both the identifier and a retrieval URL
for a number of schemas used by SRW. The identifier is used
within protocol exchanges. The URL would be used (for example
by developers)
to
retrieve the schema. For example (aside from MODS) The Dublin Core
schema, at URL http://www.loc.gov/srw/dc-schema.xsd is
identified by the URI info:srw/schema/1/dc-v1.1.
Resolvable URIs do not need to be 'http:' URIs. For example, the
URI:
Resolves to a bibliographic
description of a doctoral thesis: Modelling
Chemical Reactions: Theoretical Investigations of Organic Rearrangement
Reactions.
And as a hypothetical example, suppose we develop a URI scheme,
'terms:'
, and define the URI:
for the MARC organization code 'alaldse', the code for Duck
Springs Elementary School (Attalla, AL). This could resolve
to the string: “Duck Springs Elementary School (Attalla, AL)”.
We plan to explore this further in next month's article.
The 'tag' URI Scheme
(April 2005)
The IETF recently approved the 'tag'
URI scheme (see approved
schemes), for the creation of unique identifiers. 'tag' URIs
are used purely to identify objects, there is no associated resolution
mechanism.
The ‘tag’ URI responds to a need for identifiers
that will remain unique; are easy to
create, read, type, and remember; and which do not require a central
registration. 'tag'
proponents point out that 'tag' has advantages
over other "pure
identifier" schemes:
- UUIDs are hard to read.
- OIDs, DOIs, and 'info' URIs require registration of naming
authorities.
- URLs (E.G. ‘http’) are not well-suited to be pure
identifiers because they give the illusion of
resolvability. They are after all (by definition) "resource
locators". People by habit will try to resolve an
'http' URI, even when there is no resource accessible or locatible.
This problem is compounded by nearly every editor in the world
turning any string beginning with 'http://' into a hot link.
In addition, various URI experts point out:
- URNs are not well-suited to be pure
identifiers; see, for example, Well
then, why not just use URN URIs?
The following (at http://taguri.org,
from Sandro Hawke, one of the original developers of this scheme)
is
a brief explanation-by-example of how to create a 'tag' identifier:
I (Sandro) have a dog named
Taiko, which is a fairly obscure name, but I can't be sure
he's the only dog on the planet with that name. I want to
be able to talk about him using just his name (without reference
to myself, the town I live in, etc) and I want to be sure
people will not accidentally think I'm talking about some
other dog also named Taiko. So I'm going to give him a tag
URI.
Step 1. Identify myself. I have two choices: I can use
one of my e-mail addresses (sandro@hawke.org, sandro@w3.org,
sandro@world.std.com) or I can use a domain name assigned
to me (such as hawke.org). I could also use a shared
domain name (w3.org) if I had explicit permission from
the domain holder.
Step 2. Pick a date. It's possible that
in 100 years my great grandson Sandro Hawke IV will
be using "sandro@hawke.org" for
e-mail. He may even have a dog named Taiko, and I still
want my tag to name my Taiko, not his. So I pick some
date during which the address "sandro@hawke.org" was
definitely mine. I'll pick yesterday, Tuesday, June 5,
2001.
Step 3. Encode the date as characters,
using ISO 8601: "2001-06-05".
If I had picked the first day of a month, back in step
2, I would not include the day. If I had picked the first
day of a year, I would not include the month or day.
Step 4. Pick a unique name for the object.
But it only has to be unique for the already-chosen
identity and
date. "Taiko" seems like a fine choice here.
I don't want to use a name like "1", because
then I'm much more likely to get confused and accidentally
call my other dog "1". I also want to avoid
accidentally reusing a name, but by always using the
previous day's date I essentially eliminate that risk:
I only need to remember names for the rest of the day.
Step 5. Combine them like this: tag:hawke.org,2001-06-05:Taiko.
|
So to assign a 'tag' URI, simply: take your email
address together with a date on which you can assert that the email
address belonged to you; the combination provides
a unique namespace, and you are the authority. (The
date can simply be a year, if the email address
belonged to you on the first day of that year.) So for example,
the individual who had the email address rden@loc.gov on January
1, 2005, could (if he wanted to) assign 'tag' identifiers to his
children and cats:
tag:rden@loc.gov,2005:annie
tag:rden@loc.gov,2005:sammy
tag:rden@loc.gov,2005:pepper
tag:rden@loc.gov,2005:shadow
Note that email addresses may be used in lieu of domain names.
The 'tag' creators wanted a system that does not
rely on
domain names; many organizations and individuals do not have
a domain name, but almost all do have some form of unique base
identifier such as an email address. A domain name can be
used if the assigner owns it (as in the hawke.org example). In
any case, assignment of a 'tag' identifier never requires coordination
or
communication
with any other authority or assigner.
The base identifier might not provide sufficient qualification
forever; for example, a different person may have the email address
rden@loc.gov
in the
year 2105. But
when qualified by a date as in the examples, the combination of
base identifier and date should remain unique, as long as the new
owner of that base identifier conforms
to the naming algorithm.
Historical URI Schemes
Draft
Guidelines Version 3
(March 2005)
As noted last December an Internet draft, Guidelines
and Registration Procedures for new URI Schemes, provides
guidelines for defining, registering, and evaluating proposed
URI schemes, and procedures for registering new schemes. There
were some shortcomings in that draft, primarily, the provision
of duplicate scheme names. A new
version was developed in late February which hopefully addresses
this problem.
A new class of schemes, provisional,
had been defined, for schemes requiring less technical review than permanent schemes
which must undergo rigorous expert review. Provisional schemes,
according to the earlier draft, may share names with existing schemes.
That's caused considerable controversy - the possibility of duplicate
scheme names.(We reported in January that
there was mixed feelings about duplicate scheme names. That's changed;
duplicate scheme names seem now to be universally regarded as bad.)
The new draft proposes a way to avoid duplicate scheme names. It
defines yet a third class, historical. Thus there
would be three classes: permanent, provisional, and historical.
In defining (and justifying) this new class the document says, "In
some circumstances, it is appropriate to note a URI scheme that
was once in use or registered but for whatever reason is no longer
in common use or the use is not recommended. In this case, it is
possible for an individual to request that the URI scheme be registered
(newly, or as an update to an existing registration) as 'historical'.
Any scheme that is no longer in common use may be designated as
historical; the registration should contain some indication to
where the scheme was previously defined or documented."
So how does this new class address the problem of duplicate scheme
names? Will it work? The answers are still unlcear.
The move to revise the registration
procedures
was
motivated
by the
proliferation of unregistered schemes. The burdensone
registration procedures have produced a register out-of-touch
with reality, as people simply define and use a scheme without
bothering to register it. Streamlined registration procedures
would not only provide incentive for a scheme developer to register
a new scheme, but also provide a means to get existing
unregistered schemes registered. But bringing these schemes out-of-the-closet
is going to turn up a lot of duplicate names.
On the other hand many of the unregistered
schemes have been abandoned, used very little, or never used at
all. Among the unregistered schemes, some
are considered (informally) "bogus", some are inactive
but historically significant, and others are active. Those
that are active should be registered as permanent or provisional,
those that are inactive but historically significant should be
registered as historical. It is hoped that the bogus schemes
will then simply dissapear.
URI Generic Syntax - Revision Complete
(February 2005)
We reported last December on work to
replace RFC 2396,
URI Generic Syntax (1998), with a more contemporary and
comprehensive document describing URIs. That work is now complete.
The resulting URI spec is now an official IETF standard, their
66th published standard: RFC
3986, Uniform
Resource Identifier (URI): Generic Syntax (January 2005);
authors: Tim Berners-Lee, Roy Fielding, Larry Masinter. It
defines a single, generic syntax for all URIs.
In addition to replacing RFC 2396, RFC 3986
incorporates (and replaces) RFC
1808,Relative URLs (1995), and RFC
1738 Uniform Resource Locators (1994) though it excludes
portions of RFC
1738 that addressed specific URI schemes; those
portions
will be updated as separate specs. It also obsoletes RFC
2732
which specified a format for IP addresses in URLs.
See URI Generic Syntax for a summary of the syntax.
Duplicate Scheme Names: Good or Bad?
(January 2005)
As we reported in December there is an Internet Draft: Guidelines
and Registration Procedures for new URI Schemes.
The draft neglects to clearly state that there cannot be duplicate
uri scheme names registered, causing some controversy: Can this possibly
be proper Internet architecture? Well perhaps, if on balance it
does more good than harm.
The draft does note: “The goals for registering URI Schemes
are to avoid (when possible) duplicate use of the same URI scheme
name for different purposes, …”, apparently acknowledging
the possibility of duplicates. The proposed registration rules
are based on reality: it is possible to invent and deploy a URI
scheme without IANA and IESG approval. The goal is to avoid duplication
in the real world; assuring uniqueness in the registry doesn't
do that, and
it can result in the registry being out of touch with the real
world.
Duplication is a bigger problem in some cases than in others.
For example, suppose there are two fairly compatible
schemes with the same name -- one is a minor
(experimental) enhancement of the other, and the differences don’t
clash. That might work. But if there are two registered schemes of the same
name with
completely different syntax and behavior, a developer
writing a web browser might support one, while another developer might support
the other. That compromises URI integrity.
There appears to be three conditions contributing to the
problem of duplicate URI schemes: (1) private schemes, (2) abandoned
schemes, and (3) malicious
registration.
Private Schemes
This is the now-well-established practice of defining and deploying a URI scheme
long before it is submitted for registration. If two independent groups inadvertently
define 'widgy:' as a URI scheme, and later, both attempt registration, should
only the first be allowed into the registry? Suppose the second to attempt
registration was first to actually use the scheme name.
Abandoned Schemes
Dan Conolly of the W3C provides this scenario:
”Consider VenderCo who has just released WizBangTool that supports
wizzy: URIs. Somebody files a bug that says 'your scheme isn't registered' so
they follow their nose to the registry, only to find that some long-defunct sourceforge
project registered wizzy: 5 years ago. If unique registration is a requirement,
VendorCo's choices are to (a) change their software and register a wizzy2: uri
scheme, or (b) ignore the process." Conolly notes that neither is a
desired outcome, this is a scenario quite likely to occur, and he concludes that
attempting to assure that all IANA-registered URI scheme names are unique is
likely to produce a useless, irrelevant registry.
A suggested approach is to provide some means whereby a defunct provisional
registration may be removed from the register, either by insisting that it
remain only so long as an up-to-date specification and owner can be identified,
or by giving some reserve power to the IESG to remove it. Removing the old
wizzy should be no problem it if really is defunct. But, some suggest, if it
turns out that people somewhere are still using it, VendorCo should be forced
to use wizzy2.
Malicious registration
This is similar to the land-grab of internet domain names. There is some sentiment
that a procedure that does not stricly enforce uniqueness will render this
practice useless.
The discussion and debate on this continues.
Proposed new Registration Procedure for URI Schemes Introduces
Provisional Class of Schemes (December 2004)
A new Internet Draft: Guidelines
and Registration Procedures for new URI Schemes proposes
procedures for new URI schemes, simplifying existing
procedures and requirements
by providing
for provisional schemes requiring no technical review
and which may share names with existing schemes.
The draft, if approved, will replace
RFC 2717 - Registration Procedures for URL Scheme Names -- along
with RFC 2718 - Guidelines for new URL Schemes 1999; both
1999 RFCs.
RFC 2717 had defined a set of registration trees; one
was the main tree
(named ' IETF', managed by IANA), and there has always been a provision
to approve additional trees. There have been
two
problems with
this approach: nobody wanted their scheme to be "second class",
and no such additional registration trees were ever approved.
The
new system will not eliminate the first problem-- provsional
schemes
may still be seen as second class -- but the trees will be
eliminated and all schemes, provisional and permanent,
will fit into a single namespace.
Provisional schemes,
which
may
be registered
without passing any review process, will be useful for legacy
URI schemes, widely deployed without registration, for which review
would be
inappropriate;
it is also useful for private or experimental use. The main requirement
for a provisional URI
scheme
is that there must not
already be a permanent scheme with the same name. Permanent
status will apply where there is general agreement that
the scheme meets the outlined criteria; permanent status
is intended for use by IETF standards-track protocols and requires
a substantive review and approval process.
The primary intent of introducing provisional status is to
discourage
multiple definitions of URI scheme names for different purposes,
while recognizing and accomodating this practice because
it is not going to stop. There are
cases where separate communities have already established differing
uses of the same URI scheme name for different purposes.
New Draft of RFC 2396 (December 2004)
A new draft of Uniform
Resource Identifier (URI): Generic Syntax, (September, 2004)
has been released.
This is an update to RFC
2396 (1998) and also incorporates and replaces RFC
1738 "Uniform Resource Locators" (1994) and RFC
1808 "Relative Uniform Resource Locators" (1995).
RFC 2396
defines the generic syntax of a URI (which it defines as a compact
string of characters
for identifying an abstract or physical resource) and usage guidelines.
It defines a grammar
such that an implementation can parse the common components of a URI
reference without knowing scheme-specific requirement. (It does not define a
rigorous grammar to apply to every URI scheme. Each individual
scheme specifications must define a specific grammar.)
Registration of URI Schemes (August 24, 2004)
At a recent meeting (see http://lists.w3.org/Archives/Public/uri/2004Aug/0007.html)
of the committee overseeing the development of URI technology (a
joint IETF/W3C group) registration of URI
schemes was discussed. There appears to be general agreement that
the process is broken: The public perception of URI scheme registration
is
at odds with reality. There are many schemes whose attempted
registration has languished for years, for lack of any deterministic
process for either registering or rejecting them.
There are guidelines for URI schemes, and in general a scheme
is supposed to meet these guidelines in order to be registered
(that is, to be listed in the Official IANA Registry of URI Schemes
at http://www.iana.org/assignments/uri-schemes).
However exceptions have been made for schemes to be registered
even if they did not quite meet URI guidelines, if they were widely
deployed. As a result, people tend to create a scheme, hope it
will get widely deployed, and thus bypass guidelines and get registered.
The URI guidelines set a high bar, whose original purpose was
to control the number of registered schemes. But people just keep
inventing new schemes anyway and defer registration. Not surprisingly,
there are now conflicting schemes (schemes with the same name,
e.g.,
'mmms:'
has different
interpretations used
by 3GPP and Microsoft).
Namespace
conflict is probably the most serious potential problem that
URI technology faces. It was suggested to abandon the idea that
registration will reduce total number, and that the primary purpose
of registration should
be to eliminate namespace conflicts. However the quality control
advocates still want some barrier.
Suggestions:
- A form of registry that might set some line -- schemes
below the line are "not as good as" schemes above
the line.
- A provisional registration that provides a specification or
an implementation pointer, for six months.
- A rule that if a proposal already
has a provisional registration and a specification, it wins.
- A requirement that a proposed scheme have
two different implementations.
- Two classes of schemes: ones with a published specification,
one without.
- Discouragement of non-protocol schemes.
- Register of implementations of URI schemes. Rather
than setting a threshold ("must have at least 1 implementation")
just document the values in the registry, and let people reach
their own conclusions."If barriers are established, people
will do whatever they do anyway."
There was discussion of abuse -- registering URI schemes
with other people's trade names, etc. One suggestion is that perhaps
multiple registrations for the same scheme might be allowed --
document the usage and let the antagonists fight it out. There
isn't much sentiment for that suggestion, though --It was
observed that the web simply doesn't work with conflicting namespaces
schemes.
|