Skip Navigation Links  The Library of Congress >> Standards
IFLA/CDNL Alliance for Bibliographic Standards - Library of Congress
  URI Resource Pages Home >>

 

 

News

URI Resource Page

Latest News

We'll try to keep you up-to-date here, on the latest relevant developments on URIs and identifiers.

XRI Update October 2008
XRI Ballot Fails June 2008
Uniform Access to Metadata May 2008
Identifers for Non-information Resources  April 2008
URI Template Draft Specification Revised March 2008
'info:' URI Scheme Adds 'lc' Namespace  October 2007
XML Namespace URIs April 2007
W3C Publishes TAG Finding: Metadata in URIs January 2007
URI Template October 2006
Metadata in URIs August 2006
NISO Identifier Roundtable  July 2006
'info:' URI Scheme Now Officially Approved  December 2005
'info:' URI Scheme Approved -- and "Almost" Official  November 2005
New Draft of "mailto:" URI Scheme October 2005
IRI: the Internationalized Resource Identifier September 2005
DCC Workshop on Persistent Identifiers August 2005
A URI Framework for Controlled Vocabulary Terms and Codes July 2005
Identifiers vs. Resolvable URIs  June 2005
The 'tag' URI Scheme   April 2005
Historical URI Schemes Draft Guidelines Version 3   March 2005
URI Generic Syntax - Revision Complete  February 2005
Duplicate Scheme Names: Good or Bad?
January 2005
Proposed new Registration Procedure for URI Schemes Introduces Provisional Class of Schemes  December 2004
New Draft of RFC 2396 December 2004
Registration of URI Schemes
  August 2004


XRI Update
(October 2008)

Discussions continue between OASIS and W3C over the fate of XRI, the Extensible Resource Identifer, developed by OASIS. Agreement hasn't been reached but there is progress, and XRIs are somewhat clearer.

XRI development began in 2003 when OASIS created a Technical Committee to develop an abstract identifier, to identify non-information resources.

A person, for example, is a non-information resource. You can assign it an identifier, but you cannot retrieve it. By contrast a document is an information resource. You can assign it an identifier and retrieve it by the identifier. For a non-information resource the best you can do is retrieve a description.

An example of an XRI is:

=drummond

The '=' symbol signifies that the string following identifes a person. In this case it is the person registered (in the XRI registry) as 'drummond'.

 Another example:

@boeing

The '@' symbol signifies that the string following identifies an organization or company. This example assumes that Boeing has registered itself as a company.

Top level objects -- people, companies -- are globally registered, but for subordinate components there is delegation of authority. Thus consider the hypothetical XRI:

@boeing*people*smith

Boeing assigns the components, in this example, ‘people’ subordinate to ‘boeing’, and ‘smith’ subordinate to ‘people’. 

XRIs, URIs and HXRIs

Originally,  OASIS wanted XRI to be a URI scheme. Thus for example the XRI @boeing would be expressed as an XRI URI as follows:

xri://@boeing

Attempts to create an XRI URI scheme met with strong resistance from the W3C and have been abandoned.  But there is yet another type of identifier associated with an XRI: the HXRI, an 'http' URI (one of perhaps several) associated with an XRI, used to retrieve a descriptor of the resource identified by the XRI.

For example,

http://xri.net/=drummond

is an HXRI for the XRI =drummond, where xri.net is an XRI resolver.

In general an HXRI for a given XRI is an 'http' URI where the authority component ends with an XRI resolver, and the path component is the XRI. Another HXRI for the same XRI is:

http://xri.freexri.com/=drummond

Where xri.freexri.com is a different XRI resolver.

Resolution

Resolution of HXRIs is crucial to the viability of XRIs.Now if every resolution need go through a central (or one of a few central) XRI resolver(s), resolution will be cumbersome.

It hasn't been completely settled but resolution may work something like this. Consider again the hypothetical XRI @boeing*people*smith.   An HXRI for this XRI might be

http://xri.boeing.com.xri.org//@boeing*people*smith

An XRI-aware client application will recognize xri.org as an XRI resolver, strip off xri.org, and the resultant URI will be

http://xri.boeing.com//@boeing*people*smith

Thus the request goes straight to boeing.com and xri.org is bypassed. And of course boeing.com knows what to do with it, since it coined the HXRI to begin with. This is a (proposed) feature of XRI resolvers: any company (Boeing for example) can coin URIs using the domain name of the XRI resolver without registering that URI within that domain.

Of course, a non-XRI-aware application will send the http request to xri.org, who will strip off "xri.org" and pass on the resulting URI to xri.boeing.com.  So with this scheme reliance on central resolvers, though not eliminated, would be relieved.


XRI Ballot Fails
( June 2008)

XRI, the Extensible Resource Identifer, is a new identifier type proposed by OASIS. It is characterized as an "abstract" identifier, independent of location and protocol.

OASIS recently balloted the XRI syntax and resolution specifications; both ballots failed. The balloting may have been influenced by W3C, who has taken a position opposing XRI. It seems that more discussion between OASIS and W3C is needed before these specifications can be approved.

The relationship of XRIs to URIs is somewhat unclear.  An XRI begins with an optional prefix, “xri://”, however, 'xri:' has not been proposed as a URI scheme yet. The intention is to register the scheme with IANA, if XRI becomes an OASIS Standard.

Some examples of XRIs, found in the syntax specification, are:

These examples are cited in the oreillynet.com blog posting of May 29, "XRIs Bad, URIs Good" which points out that the spec doesn't give much of a clue what these XRIs mean.

The W3C Technical Architecture Group has stated that they "are not satisfied that XRIs provide functionality not readily available from http: URIs." Further discussion between W3C and OASIS will likely occur in the coming months.

Uniform Access to Metadata
( May 2008)

Given a URI, obtain metadata for the resource it identifies.

The W3C has been discussing means of Uniform access to metadata, where "metadata" refers to bibliographic, access control, and other types of metadata, or in general, a description of the resource. The metadata (or description) is assumed to be logically separate from the resource. Thus the need is to:

Develop a uniform method such that for a given URI we may obtain metadata for the resource it identifies without necessarily accessing the resource.

he W3C has articulated the following motivations for this need:
  • Uniform access to metadata is required because the specific method for extracting metadata from content will vary wildly from one media type to the next.
  • Many media types (e.g. application/x-compressed) have no place to put metadata at all.
  • We want to be able to obtain metadata without necessarily retrieving the content, because the resource might be something we don't want to load (for reasons of size, license, or other kind of application suitability).
  • Sometimes metadata is generated independently of content, and we don't want to (or can't) modify existing content streams by inserting metadata into it.

There are several proposed approaches:

  • via the http link header
  • via the html <link> element
  • via the http 303 status code
  • a new GET response header
  • a new HTTP request method ('MGET' for example)
  • Use HTTP content negotiation
  • Via the Archival Resource Key (To creat a metadata link for an ARK, appends "?" to the URI.)

All of these approaches are controversial, some more than others.   We will report on further developments.


Identifers for Non-information Resources
( April 2008)

What happens when you try to retrieve a resource that is inherently not retrievable?

A URI, by definition, identifies a resource. The definition of resource is “anything that has identity” (admittedly, circular).  Typically a resource is a web page, a document, something “network retrievable”.

But a physical object - a person  for example - has identity, is therefore a resource (by definition), and can be assigned a URI.  Often, such a URI is actually an ‘http:’ URL:  http://www.example.com/joe-smith, for example.

So what happens when that URL is seen on a web page and you click on it? What do you want to happen?

There are also abstract resources.  For example, the Dublin Core concept of “title” is assigned a URI: specifically, http://purl.org/dc/elements/1.1/title. Dublin Core 'title' is assigned a URI because the concept must be unambiguously identified, to distinguish it, not only from other Dublin Core concepts ('contributor' for example) but even from the concept of title within a different metadata element set.  Title, the concept, is a resource - not a physical resource, but you still can’t “retrieve” it -  It’s called an abstract resource.  And its URI is an  ‘http:’ URL, so (as we asked above) what happens when that URL is seen on a web page and you click on it? What do you want to happen?

We have three types of resources, then: physical, abstract, and network-retrievable. Web architecture distinguishes two types: information resources and non-information resources.  (Web architecture doesn't like the expression "retrievable resource", preferring "information resource". Physical and abstract resources are combined into the single category "non-information resource".)

So the questions above, combined and rephrased:

When the URI for a non-information resource is an  ‘http:’ URL, what happens when that URL is seen on a web page and you click on it? What do you want to happen?

For context, first consider what happens when you click on a URL for an information resourse.

When you click on a URL that you see on a web page, typically an http request goes to the server named in the URL (e.g. for the URL http://www.loc.gov/standards/sru, the http request is to the server www.loc.gov).  The response to that request normally is the web page (or other type of information-resource) named by the URL.  A status code is returned, along with or in place of the resource.  For a normal completion (the resource is supplied normally) the status code is “200 – ok”.  If the resource isn’t there, the status code returned might be “404 – not found”, or it might be “303 – see other”. This latter code (theoretically) indicates that the server thinks you should be redirected to a (specific) different URL. In that case, the server should supply the suggested URL, and you, the user, may never see the “303” code, because your client might perform the redirection automatically.

Now consider these three status codes - 200, 404, 303 - in the context of a non-information resource; is one (or more) of these appropriate?

  • Certainly not "200 - ok". That basically means 'here comes the content you requested"; when the status is '200', content is expected to be included in the response. For a non-information resource, there is no content.
  • "404 - not found" might (on the surface) seem appropriate, but that would cause chaos on the web. 404 statuses would be generated by the billions; huge error reports would be sent to web authors from network administrators telling them to fix the broken links.
  • A code of '303' would not seem to be appropriate - "the resource isn't at this URL, try this alternative URL instead" - it isn't going to be there either.

However, status code '303' does seem to be what web architecture prescribes for the attempted retrieval of a non-information resource.  It is a somewhat controversial approach, and is currently the subject of re-examination. 

The '303' status itself is the subject of some confusion. The formal name ascribed to '303' status within the http protocol standard is "see other" and the definition is, essentially, " The response to the request can be found under a different URI". 

However - and this is the crucial point - some web architects assert that the response available at the alternative URI is not the desired resource itself (it couldn't be anyway, it's a non-information resource) but rather, it is metadata about the resource.

This approach to dealing with the attempted retrieval of a non-information resource is somewhat controversial and raises a number of questions, even if you assume the latter-day interpretation of '303' status (not all experts do) that the alternative URI points to metadata about the desired resource. Before describing the controversy, some additional background will help provide further context. Two points:

  • Some web (and semantic web) architects take the postion that URIs in general should be "actionable".  One view is that any URI, no matter what the scheme, must be actionable. That's an extreme view, not unanimously held. But most hold the view that an 'http:' URI should always be actionable, even for a non-information resource. When asked what should be retrieved for a non-information resource, the answer invariably is "a description of the resource", i.e. metadata.
  • There is increasing interest in developing a uniform method for obtaining metadata for a resource without necessarily having to retrieve the resource itself. This problem is characterized by the W3C as Uniform Access to Metadata. (We plan to explore this subject in a future report.)

These two points are obviously related, particularly in the case of a non-information resource. In fact a current suggested method addressing both problems is the '303' status: When an attempt is made to retrieve a non-information resource, return http status code '303' along with the URL of a description of the resource.

In fact, one of the key W3C architects is quoted as saying:

200 means (basically) “Here comes the content of the document you asked for” and 303 means “Here is the URI of document ABOUT the thing you asked for".

This approach leaves answered some questions/issues.

  • Suppose you have the URL of a known information resource. How do you explicitly request the description, rather than the resource itself.
  • Suppose you have the URL of a known resource, but you don't know if it is an information resource or a non-information resource. Your request to retrieve that resource results in a '303' status. You still don't know it it is an information resource or a non-information resource. (It could be an information resource but the server might not have it immediately available, so it does the next best thing, supplies a description.)
  • A server gets a request for a non-information resource. The server knows about the resource, and so it returns a '303' status. But the server does not have (nor does it know of) a description. A '303' status should be accompanied by a URL (for a description of the resource). Should the server simply return a '303' status without an accompanying URL, contrary to the prescribed approach?

We'll keep an eye on this and report further.


URI Template Draft Specification Revised
( March 2008)

A new version of the Internet Draft, "URI Template", has been released.  (See URI Template, October 2006.)

A URI Template is a URI-like string that contains embedded expressions (delimited by curly braces, '{' and  '}'), called "expansions". The template itself is not a URI; a template processor replaces expansions with their calculated value to produce a bonafide URI.

As a simple example, given the following URI Template:

http://www.loc.gov/standards/{standard}

And the following variable value:

standard = "mods"

 The expansion of the URI Template is:

http://www.loc.gov/standards/mods

For a more complex example, look at the following template:

http://z3950.loc.gov:7090/voyager?{-join|&|version,operation,query}

The part after '?' says: for each of the variables version, operation, and query; join it in the form "variable=value",  separated by '&' (ampersand).

For the following variables:

  • version: 1.1
  • operation: searchRetrieve
  • query: dinosaur

The expansion of the URI Template is the SRU Request:

http://z3950.loc.gov:7090/voyager?version=1.1& operation=searchRetrieve&query=dinosaur

URI Template is a Draft Internet Standard. It is available at http://www.ietf.org/internet-drafts/draft-gregorio-uritemplate-03.txt.   It is still a work in progress.
'info:' URI Scheme Adds 'lc' Namespace
( October 2007)

The 'info' URI Scheme Registry added the namespace 'lc' on October 15.  See Info URIs for Library of Congress Identifiers.


XML Namespace URIs
( April 2007)

A new note addresses the question "what form should an XML namespace URI take?" and compares and contrasts XML namespace URIs with schema location URIs, and schema identifiers.See XML Namespace URIs (and schema location URIs, and schema identifiers.


W3C Publishes TAG Finding: Metadata in URIs
( January 2007)

The W3C Technical Architecture Group (TAG) has published a TAG Finding, The use of Metadata in URIs, January 2, 2007.

An earlier report in August describes the finding, see Metadata in URIs.  The August version was a draft finding, not yet official, but the final published report is substantially the same.

Comparing the earlier draft with the official publication, the following changes are noted:

  • Deleted from earlier draft:
    • The section: " Avoid Dependencies on metadata".
    • Good Practice: "Guess information from URIs only when the consequences of an incorrect guess are acceptable".
  • Added
    • Good Practice: "When saving to filesystems that use extensions to represent media types, user agents MUST choose an extension that is constistent with the media type of the representation."
    • A new section: "Confusing or malicious metadata".

 


URI Template
(October 2006)

A new Internet Draft describes the proposed URI Template, a string that may be transformed into a URI by substituting values for variables that are embedded within the string. A URI template may be thought of as representing a class of URIs; the template representation is useful for conveying the general structure of URIs within the class.

The following template could represent the class of LCCN URIs:

info:lccn/{lccn}

Substituting an LCCN for "{lccn}" produces an LCCN URI.For example substituting the LCCN n78089035 produces:

info:lccn/n78089035

Template Variables

The draft also introduces template variables: the paramerized components of a  URI template. A list of values may be input to a process representing the URI template, resulting in the production of a URI within the class represented by the URI template.

For example consider the template:

http://www.knuckleball.com/{a}.{b}

If the following table of variables and corresponding values is input to the process corresponding to this template:

Variable Value
a hoyt
b wilhelm

This URI will be produced:

http://www.knuckleball.com/hoyt.wilhelm

Metadata in URIs
(August 2006)

The W3C Technical Architecture Group (TAG) has published a draft finding, The use of Metadata in URIs.

URI naming authorities often define structures allowing URIs to carry metadata about identified objects. Metadata might include, for example, creation date, MIME type, or even a digital signature to verify the integrity of the object’s content. There are benefits to an orderly mapping from metadata to URI, and naming authorities often use conventions that facilitate association of a URI with its corresponding object. Conventions based on filename or customer id are examples. But there can be drawbacks.

The TAG finding discusses the suitability of embedding metadata in a URI, and of inferring information from URI metadata.

Among the recommendations from the report are:

  • URIs intended for direct use by people (as opposed to machines) should be easy to understand, and should be suggestive of the resource actually named.
  • People should not infer or guess information from a URI unless the consequence of a wrong guess is acceptable.
  • Software should not rely on metadata inferred from a URI, except as formally documented in a standard or applicable specification.

To briefly illustrate these points, consider the following (hypothetical) advertisement, perhaps on the outside of a city bus:

For the Best Chicago Weather Information
go to
www.weather.com/chicago

As a printed URI, it is intuitive, easy to remember, and suggestive of the resource identified.

Suppose the URI were instead:

http://www.weather.com/123Hx67v4gZ5234Bq5rZ

You would certainly find this annoying if the URI were intended for human use. On the other hand it would be a perfectly appropriate URI if it were intended strictly for machine use. 123Hx67v4gZ5234Bq5rZ might be based on a database key facilitating efficient access to the weather data at the server.

You might infer from the (first) URI that you could get the weather in Boston, if  you were to try:

www.weather.com/boston

That might work and it might not. The advertisement doesn't take responsibility for providing weather information for anywhere other than Chicago, but there is little risk in trying -- little risk for a person. Software, on the other hand, should not make this inference.

Suppose instead the advertisement said:

For the Best Local Weather Information
go to
www.weather.com/your-zip-code-here

Then, you can reasonably assume that a weather report is available by substituting a zip code.

The full text of the draft finding is available at :http://www.w3.org/2001/tag/doc/metaDataInURI-31-20060609.html.


NISO Identifier Roundtable
(July 2006)

NISO, the National Information Standards Organization, held an Identifiers Roundtable, March 13-14, at the National Library of Medicine in Bethesda, Maryland. 

NISO, which has a long-held interest in identifiers (DOI, ISBN, ISSN, SICI, 'info:', etc.) brought together experts representing libraries, vendors, information centers, e-learning systems, content providers and aggregators. They discussed means to promote the long term sustainability of identifiers: identifier-services infrastructure, community and institutional support, business models, registries; how to create, implement and support identifiers and identifier systems; how to address confusion over identifers, confusion which for several years has driven up the cost of developing and managing systems. Topics also included: "what makes a good identifier?", identifier roles, identifier attributes, identifiers and the web, imbedded identifiers, and standards needed.

Observations and Conclusions

Some observations and conclusions from the meeting:

  • Identifier infrastructure must support services for creating identifiers, binding them to objects, and resolution to obtain the identified object or its metadata.
  • Long term viability of identifiers requires viable business models.
  • Identifiers, particularly those exchanged between systems, should be based on public standards, to prevent collisions between identifiers developed in different contexts.
  • There is less disagreement on the nature and properties of identifiers than thought. Perceptions of disagreement arose from differing contexts of discussion, in particular the differing intended uses of specific identifiers.
  • A registry of identifier schemes should be developed, including associated services and policies for each scheme.
  •  The "info" URI registry should become a focal point for community identifier needs.

Report

The workshop report is available at http://www.niso.org/news/events_workshops/ID-workshop-Report2006725.pdf


'Info:' URI Scheme Now Officially Approved
(December 2005)

We reported last month that the IESG had approved the 'info:' URI scheme. It has now been listed by IANA (the Internet Assigned Numbers Authority) at http://www.iana.org/assignments/uri-schemes.html, their official register of URI schemes.

The register lists permanent, provisional, and historical  schemes.  'info:' is conferred "permanent" status.

The 'info:' URI scheme is defined at http://www.ietf.org/internet-drafts/draft-vandesompel-info-uri-04.txt. More information about this scheme is available on our 'info:' Resource Page.


'Info:' URI Scheme Approved -- and "Almost" Official
(November 2005)

The IESG (Internet Engineering Steering Group) has approved the document:The "info" URI Scheme for Information Assets with Identifiers in Public Namespaces. This effectively means that 'info:' may now be considered an approved URI scheme.

The action was announced November 3 in a memo from the IESG, responsible for technical management of IETF activities and the Internet standards process, to the IETF: Document Action: 'The "info" URI Scheme for Information Assets with Identifiers in Public Namespaces' to Informational RFC.

As of November 16 "info" has not yet been added to the Official IANA Registry of URI Schemes.  This may take several weeks because the process for maintaining the registry is currently being revised.


New Draft of "mailto:" URI Scheme
(October 2005)

In September we reported on the IRI (Internationalized Resource Identifier).  Now, a new internet draft  propose changes to the mailto URI Scheme, for compatibility with IRIs. 

The 'mailto' URI scheme defines the URI format for designating an email address.  In it's simplest form, a 'mailto:' URI looks like:

mailto:someone@somewhere

The URI scheme is 'mailto:' and the resource identified by the URI is an email address; in the above example it is "someone@somewhere".

Typically, when a user clicks on a 'mailto:' URI a browser will construct an email message with the recipient field set as indicated and otherwise empty, leaving the user to input the subject, text, and other fields. For example, the email address nellie.fox@59sox.com, might be coded in html as:

<a href="mailto:nellie.fox@59sox.com">Nellie Fox</a>

So that the recipient's name is visible on the web page and when clicked the URI is activated and an email message is constructed.

Additional email parameters besides the recipient address may be included in the URI, using the standard form for a URI query and parameters - '?' preceding the query, and '&' separating parameters.  Thus for example the following ....

mailto:someone@somewhere?
subject=RSVP%20November%201%20Meeting
& body=Will%20Attend


... would generate an email message to:"someone@somewhere", with subject: "RSVP November 1 Meeting" and with body "Will Attend". Note that spaces, which are not allowed to occur in a URI, are percent encoded -- they are replaced by '%20' which is the escape character followed by the two-digit hex ASCII code for space.

The new internet draft proposes to extend the existing 'mailto:' scheme definition to allow characters to be percent-encoded based on UTF-8, offering a more consistent way of dealing with non-ASCII characters.

For example, suppose you want "Culinary Café" to be the subject. The eacute character is encoded in UTF-8 as C3A9, so this subject field would be encoded as:

&subject=Culinary%20Caf%C3%A9


IRI: the Internationalized Resource Identifier
(September 2005)

URIs have traditionally been limited to English words and Latin characters. Many languages however are based on scripts with alphabetic characters other than A-Z; these characters are often transcribed into Latin letters for use in URIs. These transcriptions introduce ambiguities.

The URI limitation owes to historical limitations of operating sytems and software. But nowadays, software can handle a wide variety of scripts and languages, and people want to use them in identifiers.

RFC 3987 defines the IRI -- Internationalized Resource Identifier, a complement to the URI -- the Uniform Resource Identifier.

The traditional URI is defined as a sequence of characters from a limited subset of the US-ASCII character repertoire. The permitted subset consists of uppercase letters (A-Z), lowercase letters (a-z), decimal digits, and a few additional characters. Some of these additional characters are reserved; most important is the percent (%), used as an "escape" character. ‘%’ followed by two hex digits in a URI is used to signify the ASCII character represented by the two hex digits. For example, the space character is not allowed within a URI string, so "%20" is used in its place. (Hex 20 is the ASCII value for space.)

Thus the following is not a valid URI:

http://www.loc.gov/this and that

But this is:

http://www.loc.gov/this%20and%20that


This works only for characters that can be represented by two hex digits, i.e. the US ASCII set. Actually, the allowable URI characters – those that may be “percent encoded” -- are those in the range hex 20 (space) through 7F (delete) -- equivalently, decimal 32 through 127.

IRIs are defined similarly to URIs, but the set of allowed characters is extended beyond hex 7F. The IRI definition provides a mechanism to transform any IRI to a cannonical form which conforms to the URI syntax: Each character outside of the allowable URI set is coverted to a sequence of one or more UTF-8 characters, each of which is then converted to ‘%xx’, where ‘xx’ is the UTF-8 hex value for the character. This mapping from an IRI to an URI produces a syntactically valid URI, and it is an unambiguous transformation (applying it to an existing URI has no effect) and so every URI is, syntactically, a valid IRI.


DCC Workshop on Persistent Identifiers
(August 2005)

A Digital Curation Centre (DCC) Meeting on Persistent Identifiers was held June 30 - July 1 at the University of Glasgow.   A meeting report is available in Ariadne.


A URI Framework for Controlled Vocabulary Terms and Codes
(July 2005)

The library community has an interest in the development of a framework to represent controlled vocabulary terms and codes as URIs. The framework could extend to data/metadata elements. No such framework yet exists; this article takes a preliminary look at some approaches considered.

  1. Assign ‘http:’ URIs
    The main benefit of this approach is that the DNS facilitates the decentralized creation of ‘http:’ identifiers. Another major feature is that all browsers recognize ‘http:’

    This approach has drawbacks though. Overuse of the ‘http:’ scheme, out of convenience, causes considerable confusion - an ‘http:’ URI is supposed to be resolvable, although protocol experts do point out that according to a careful reading of the http protocol, this isn’t strictly true. But not everyone is comfortable with that argument, as the ‘http:’ protocol is defined by a huge, complex document that us simple folk will never read carefully. 

    And the confusion caused by using an ‘http:’ URI for a pure identifier is illustrated in the example (see Identifiers vs. Resolvable URIs, June 2005) where an XML namespace is identified by an 'http:' URI. People cannot resist the urge to click on http://www.loc.gov/z3950/agency/zing/srw/diagnostic/. The maintenance agency for that XML namespace receives regular “broken link” reports, because it is a pure identifier and does not resolve.

    Another argument for using ‘http:’ is that any URI, even for a controlled term, though seemingly an identifier, should resolve to something, even if only a human-readable definition of that term, and that “something” would likely resolve via http. There are a couple counterarguments here. When resolution is a secondary/incidental function, it often is neither reliable nor predictable. And when resolution is a primary function, then conceivably (perhaps likely), a protocol other than HTTP would be used for resolving terms.

    One of the key features of the ‘http:’ scheme, decentralization, may be a drawback for controlled terms, where decentralization isn’t always necessarily desirable. It might be useful to have some coordination exercised over the authority to register identifiers, so that, for example, a given term isn’t defined many times.
  2. New URI scheme
    One of the alternatives to using ‘http:’ URIs is to define and register a new URI scheme. The benefit of this approach is name recognition. For example, consider the hypothetical URI:

    terms:marcOrg/alaldse


    This would be an identifier for the MARC organizational code 'alaldse'  (see "Sub Schemes" below), based on a URI scheme, 'terms:' Casting codes in this manner would give the 'terms:' scheme much more visibility than if it were cast within the 'http:' framework, for example, as:

    http://www.loc.gov/terms/marcOrg/alaldse



    The drawbacks of this approach are (1) browsers are not going to recognize an unlimited (or even a large) set of URI schemes, and (2) URI schemes are difficult to register.
  3. Sub Schemes
    An alternative to the above two approaches -- (1) ‘http:’, and (2) new URI scheme -- is to define sub schemes: “namespaces” within existing schemes. Schemes that provide sub schemes are ‘urn:’ and ‘info:’ 
    Note: 'info:' is not yet an approved URI scheme, so some may take issue with its characterization as an "existing" scheme. We consider it to be a defacto scheme.
    Suppose (as above) we want to assign URIs for MARC organization codes. The code 'alaldse'   (used in the above example) is used to represent: “Duck Springs Elementary School (Attalla, AL)”. A possible URI for this code would be:

    info:terms/marcOrg/alaldse


    This assumes that an info namespace, “terms” is defined, and also assumes a sub-authority “marcOrg” – all of this is hypothetical, just an example.  Or (with similar assumptions on URN), it could be represented as:

    urn:terms/marcOrg/alaldse


    Which should it be: 'info:' or 'urn:'?  This might depend on whether the URI is to be in the identifier or resolvable class. (see Identifiers vs. Resolvable URIs, June 2005.)

    There is talk of a protocol function that would be defined for terms, so that, for example, 'urn:terms/marcOrg/alaldse'  would actually resolve -- to the string: “Duck Springs Elementary School (Attalla, AL)”.  In that case, and if this sort of resolution is considered a primary function of the URI, then perhaps it should be cast as a URN.; If the URI is intended primarily as an identifier, then perhaps it should be cast as an 'info:' URI. In general, URNs are resolvable and 'info:' URIs are not (and there are exceptions for both).

We look forward to exploring these ideas further in subsequent articles.


Identifiers vs. Resolvable URIs
(June 2005)

It is useful to distinguish a URI whose primary purpose is to serve as an identifier from one whose primary role is to access a resource. Thus we have the identifier and resolvable URI classes.

This is a useful abstraction for modelling, not a dichotomy - often, there isn’t a clean distinction, and some URI schemes don’t fall neatly into either class. Identifier URIs may also resolve (for example, to a description of the identified object), and certainly, resolvable URIs serve as identifiers. The distinction is by primary role.

Identifier Class
In the identifier class we have for instance XML namespace identifiers, and protocol objects. An example of both is found in the following XML fragment.

<diagnostic xmlns="http://www.loc.gov/zing/srw/diagnostic/">
<uri>info:srw/diagnostic/1/38</uri>
<details>10</details>
<message>Too many boolean operators, the maximum is 10.
Please try a less complex query.</message>
</diagnostic>

This is a portion of an SRW response; it return a diagnostic to an SRW client. The URI "http://www.loc.gov/z3950/agency/zing/srw/diagnostic/" identifies the namespace for the XML element <diagnostic>.   The URI "info:srw/diagnostic/1/38" is an identifier for the actual diagnostic.

"http://www.loc.gov/z3950/agency/zing/srw/diagnostic/", is not a resolvable URI; if you click on it you're told: “Page Not Found”. It identifies an XML namespace, which is an abstraction (it has no physical manifestation) so it would be meaningless to "resolve to the namespace".  That's not to say it couldn't resolve to something (for example, a human-readable description of the namespace) but whatever it resolved to would be unpredictable and not machine-processible. So this URI is in the identifier class -- whether it resolves or not is incidental; its primary purpose is to identify.

Similarly, "info:srw/diagnostic/1/38" is an identifier, in this case identifying a diagnostic condition, and presumably the consumer of this URI (an SRW client) will look up this URI in its local diagnostic table. This URI could resolve, for example to the string:  "Too many boolean operators, the maximum is 10. Please try a less complex query."  This would serve no purpose in terms of protocol operation, though it might be useful for a protocol developer, but again, that would be incidental resolution only. Thus the primary purpose of this URI is to identify an object and so it too is in the identifier class.

Here's another example, an identifier for an XML schema.

info:srw/schema/1/mods-v3.0

This is an identifier URI, not actionable. It is used within protocol to identify a schema, in this case the MODS schema at http://www.loc.gov/standards/mods/v3/mods-3-0.xsd. An SRW request includes a parameter allowing the client to request that response records be returned according to a specific schema. If the MODS schema is requested, this URI is supplied as the value of that parameter.

Resolvable Class
Resolvable URIs, referred to informally as URLs, retrieve an object, access a resource – these are your basic actionable (also referred to as "dereferenceable") URIs. When you click on http://www.loc.gov/standards/uri/news.html, for example, your expectation is that the web page URI Resource Page: Latest News will appear.

As we noted above, info:srw/schema/1/mods-v3.0 is an identifier for a schema. That schema may resides in several places, one is: http://www.loc.gov/standards/mods/v3/mods-3-0.xsd, another:   http://www.loc.gov/srw/mods-3-0.xsd; these are both resolvable URIs, a third is http://www.loc.gov/z3950/agency/zing/srw/mods-3-0.xsd which is a different URI but the same location as the second. These three URIs serve well as locators, that is, for retrieving an object, but not as identifiers because they are neither unique nor persistent.

A schema table lists both the identifier and a retrieval URL for a number of schemas used by SRW. The identifier is used within protocol exchanges. The URL would be used (for example by developers) to retrieve the schema. For example (aside from MODS) The Dublin Core schema, at URL http://www.loc.gov/srw/dc-schema.xsd is identified by the URI info:srw/schema/1/dc-v1.1.

Resolvable URIs do not need to be 'http:' URIs. For example, the URI:

urn:nbn:se:uu:diva-3475


Resolves to a bibliographic description of a doctoral thesis: Modelling Chemical Reactions: Theoretical Investigations of Organic Rearrangement Reactions.

And as a hypothetical example, suppose we develop a URI scheme, 'terms:' , and define the URI:

terms:marcOrg/alaldse

for the MARC organization code 'alaldse', the code for Duck Springs Elementary School (Attalla, AL). This could resolve to the string: “Duck Springs Elementary School (Attalla, AL)”.

We plan to explore this further in next month's article.


The 'tag' URI Scheme
(April 2005)

The IETF recently approved the 'tag' URI scheme (see approved schemes), for the creation of unique identifiers. 'tag' URIs are used purely to identify objects, there is no associated resolution mechanism.

The ‘tag’ URI responds to a need for identifiers that will remain unique; are easy to create, read, type, and remember; and which do not require a central registration. 'tag' proponents point out that 'tag' has advantages over other "pure identifier" schemes:

  • UUIDs are hard to read.
  • OIDs, DOIs, and 'info' URIs require registration of naming authorities.
  • URLs (E.G. ‘http’) are not well-suited to be pure identifiers because they give the illusion of resolvability. They are after all (by definition) "resource locators". People by habit will try to resolve an 'http' URI, even when there is no resource accessible or locatible. This problem is compounded by nearly every editor in the world turning any string beginning with 'http://' into a hot link.

    In addition, various URI experts point out:
  • URNs are not well-suited to be pure identifiers; see, for example, Well then, why not just use URN URIs?

The following (at http://taguri.org, from Sandro Hawke, one of the original developers of this scheme) is a brief explanation-by-example of how to create a 'tag' identifier:

I (Sandro) have a dog named Taiko, which is a fairly obscure name, but I can't be sure he's the only dog on the planet with that name. I want to be able to talk about him using just his name (without reference to myself, the town I live in, etc) and I want to be sure people will not accidentally think I'm talking about some other dog also named Taiko. So I'm going to give him a tag URI.

Step 1. Identify myself. I have two choices: I can use one of my e-mail addresses (sandro@hawke.org, sandro@w3.org, sandro@world.std.com) or I can use a domain name assigned to me (such as hawke.org). I could also use a shared domain name (w3.org) if I had explicit permission from the domain holder.

Step 2. Pick a date. It's possible that in 100 years my great grandson Sandro Hawke IV will be using "sandro@hawke.org" for e-mail. He may even have a dog named Taiko, and I still want my tag to name my Taiko, not his. So I pick some date during which the address "sandro@hawke.org" was definitely mine. I'll pick yesterday, Tuesday, June 5, 2001.

Step 3. Encode the date as characters, using ISO 8601: "2001-06-05". If I had picked the first day of a month, back in step 2, I would not include the day. If I had picked the first day of a year, I would not include the month or day.

Step 4. Pick a unique name for the object. But it only has to be unique for the already-chosen identity and date. "Taiko" seems like a fine choice here. I don't want to use a name like "1", because then I'm much more likely to get confused and accidentally call my other dog "1". I also want to avoid accidentally reusing a name, but by always using the previous day's date I essentially eliminate that risk: I only need to remember names for the rest of the day.

Step 5. Combine them like this: tag:hawke.org,2001-06-05:Taiko.

So to assign a 'tag' URI, simply: take your email address together with a date on which you can assert that the email address belonged to you; the combination provides a unique namespace, and you are the authority.  (The date can simply be a year, if the email address belonged to you on the first day of that year.) So for example, the individual who had the email address rden@loc.gov on January 1, 2005, could (if he wanted to) assign 'tag' identifiers to his children and cats:

tag:rden@loc.gov,2005:annie
tag:rden@loc.gov,2005:sammy
tag:rden@loc.gov,2005:pepper
tag:rden@loc.gov,2005:shadow

Note that email addresses may be used in lieu of domain names. The 'tag' creators wanted a system that does not rely on domain names; many organizations and individuals do not have a domain name, but almost all do have some form of unique base identifier such as an email address.  A domain name can be used if the assigner owns it (as in the hawke.org example).  In any case, assignment of a 'tag' identifier never requires coordination or communication with any other authority or assigner.

The base identifier might not provide sufficient qualification forever; for example, a different person may have the email address rden@loc.gov in the year 2105.  But when qualified by a date as in the examples, the combination of base identifier and date should remain unique, as long as the new owner of that base identifier conforms to the naming algorithm.


Historical URI Schemes
Draft Guidelines Version 3
(March 2005)

As noted last December an Internet draft, Guidelines and Registration Procedures for new URI Schemes, provides guidelines for defining, registering, and evaluating proposed URI schemes, and procedures for registering new schemes.  There were some shortcomings in that draft, primarily, the provision of duplicate scheme names. A new version was developed in late February which hopefully addresses this problem.

A new class of schemes, provisional, had been defined, for schemes requiring less technical review than permanent schemes which must undergo rigorous expert review. Provisional schemes, according to the earlier draft, may share names with existing schemes. That's caused considerable controversy - the possibility of duplicate scheme names.(We reported in January that there was mixed feelings about duplicate scheme names. That's changed; duplicate scheme names seem now to be universally regarded as bad.) The new draft proposes a way to avoid duplicate scheme names. It defines yet a third class, historical.  Thus there would be three classes: permanent, provisional, and historical.

In defining (and justifying) this new class the document says, "In some circumstances, it is appropriate to note a URI scheme that was once in use or registered but for whatever reason is no longer in common use or the use is not recommended. In this case, it is possible for an individual to request that the URI scheme be registered (newly, or as an update to an existing registration) as 'historical'. Any scheme that is no longer in common use may be designated as historical; the registration should contain some indication to where the scheme was previously defined or documented."

So how does this new class address the problem of duplicate scheme names? Will it work?  The answers are still unlcear.

The move to revise the registration procedures was motivated by the proliferation of unregistered schemes. The burdensone registration procedures have produced a register out-of-touch with reality, as people simply define and use a scheme without bothering to register it. Streamlined registration procedures would not only provide incentive for a scheme developer to register a new scheme, but also provide a means to get existing unregistered schemes registered. But bringing these schemes out-of-the-closet is going to turn up a lot of duplicate names.

On the other hand many of the unregistered schemes have been abandoned, used very little, or never used at all.  Among the unregistered schemes, some are considered (informally) "bogus", some are inactive but historically significant, and others are active.  Those that are active should be registered as permanent or provisional, those that are inactive but historically significant should be registered as historical. It is hoped that the bogus schemes will then simply dissapear.


URI Generic Syntax - Revision Complete
(February 2005)

We reported last December  on work to replace RFC 2396, URI Generic Syntax (1998), with a more contemporary and comprehensive document describing URIs. That work is now complete.

The resulting URI spec is now an official IETF standard, their 66th published standard: RFC 3986, Uniform Resource Identifier (URI): Generic Syntax (January 2005); authors: Tim Berners-Lee, Roy Fielding, Larry Masinter. It defines a single, generic syntax for all URIs.

In addition to replacing RFC 2396, RFC 3986 incorporates (and replaces) RFC 1808,Relative URLs (1995), and RFC 1738 Uniform Resource Locators (1994) though it excludes portions of RFC 1738 that addressed specific URI schemes; those portions will be updated as separate specs. It also obsoletes RFC 2732 which specified a format for IP addresses in URLs.

See URI Generic Syntax for a summary of the syntax.


Duplicate Scheme Names: Good or Bad?
(January 2005)

As we reported in December there is an Internet Draft: Guidelines and Registration Procedures for new URI Schemes.

The draft neglects to clearly state that there cannot be duplicate uri scheme names registered, causing some controversy: Can this possibly be proper Internet architecture? Well perhaps, if on balance it does more good than harm.

The draft does note: “The goals for registering URI Schemes are to avoid (when possible) duplicate use of the same URI scheme name for different purposes, …”, apparently acknowledging the possibility of duplicates. The proposed registration rules are based on reality: it is possible to invent and deploy a URI scheme without IANA and IESG approval. The goal is to avoid duplication in the real world; assuring uniqueness in the registry doesn't do that, and it can result in the registry being out of touch with the real world.

Duplication is a bigger problem in some cases than in others. For example, suppose there are two fairly compatible schemes with the same name -- one is a minor (experimental) enhancement of the other, and the differences don’t clash. That might work. But if there are two registered schemes of the same name with completely different syntax and behavior, a developer writing a web browser might support one, while another developer might support the other. That compromises URI integrity.

There appears to be three conditions contributing to the problem of duplicate URI schemes: (1) private schemes, (2) abandoned schemes, and (3) malicious registration.

Private Schemes
This is the now-well-established practice of defining and deploying a URI scheme long before it is submitted for registration. If two independent groups inadvertently define 'widgy:' as a URI scheme, and later, both attempt registration, should only the first be allowed into the registry? Suppose the second to attempt registration was first to actually use the scheme name.

Abandoned Schemes
Dan Conolly of the W3C provides this scenario:
Consider VenderCo who has just released WizBangTool that supports wizzy: URIs. Somebody files a bug that says 'your scheme isn't registered' so they follow their nose to the registry, only to find that some long-defunct sourceforge project registered wizzy: 5 years ago. If unique registration is a requirement, VendorCo's choices are to (a) change their software and register a wizzy2: uri scheme, or (b) ignore the process." Conolly notes that neither is a desired outcome, this is a scenario quite likely to occur, and he concludes that attempting to assure that all IANA-registered URI scheme names are unique is likely to produce a useless, irrelevant registry.

A suggested approach is to provide some means whereby a defunct provisional registration may be removed from the register, either by insisting that it remain only so long as an up-to-date specification and owner can be identified, or by giving some reserve power to the IESG to remove it. Removing the old wizzy should be no problem it if really is defunct. But, some suggest, if it turns out that people somewhere are still using it, VendorCo should be forced to use wizzy2.

Malicious registration
This is similar to the land-grab of internet domain names. There is some sentiment that a procedure that does not stricly enforce uniqueness will render this practice useless.

The discussion and debate on this continues.


Proposed new Registration Procedure for URI Schemes Introduces Provisional Class of Schemes (December 2004)

A new Internet Draft: Guidelines and Registration Procedures for new URI Schemes proposes procedures for new URI schemes, simplifying existing procedures and requirements by providing for provisional schemes requiring no technical review and which may share names with existing schemes.

The draft, if approved, will replace RFC 2717 - Registration Procedures for URL Scheme Names -- along with RFC 2718 - Guidelines for new URL Schemes 1999; both 1999 RFCs.

RFC 2717 had defined a set of registration trees; one was the main tree (named ' IETF', managed by IANA), and there has always been a provision to approve additional trees. There have been two problems with this approach: nobody wanted their scheme to be "second class", and no such additional registration trees were ever approved.

The new system will not eliminate the first problem-- provsional schemes may still be seen as second class -- but the trees will be eliminated and all schemes, provisional and permanent, will fit into a single namespace.

Provisional schemes, which may be registered without passing any review process, will be useful for legacy URI schemes, widely deployed without registration, for which review would be inappropriate; it is also useful for private or experimental use. The main requirement for a provisional URI scheme is that there must not already be a permanent scheme with the same name. Permanent status will apply where there is general agreement that the scheme meets the outlined criteria; permanent status is intended for use by IETF standards-track protocols and requires a substantive review and approval process.

The primary intent of introducing provisional status is to discourage multiple definitions of URI scheme names for different purposes, while recognizing and accomodating this practice because it is not going to stop. There are cases where separate communities have already established differing uses of the same URI scheme name for different purposes.


New Draft of RFC 2396 (December 2004)

A new draft of Uniform Resource Identifier (URI): Generic Syntax, (September, 2004) has been released.

This is an update to RFC 2396 (1998) and also incorporates and replaces RFC 1738 "Uniform Resource Locators" (1994) and RFC 1808 "Relative Uniform Resource Locators" (1995).

RFC 2396 defines the generic syntax of a URI (which it defines as a compact string of characters for identifying an abstract or physical resource) and usage guidelines. It defines a grammar such that an implementation can parse the common components of a URI reference without knowing scheme-specific requirement. (It does not define a rigorous grammar to apply to every URI scheme. Each individual scheme specifications must define a specific grammar.)


Registration of URI Schemes (August 24, 2004)

At a recent meeting (see http://lists.w3.org/Archives/Public/uri/2004Aug/0007.html) of the committee overseeing the development of URI technology (a joint IETF/W3C group) registration of URI schemes was discussed. There appears to be general agreement that the process is broken: The public perception of URI scheme registration is at odds with reality. There are many schemes whose attempted registration has languished for years, for lack of any deterministic process for either registering or rejecting them.

There are guidelines for URI schemes, and in general a scheme is supposed to meet these guidelines in order to be registered (that is, to be listed in the Official IANA Registry of URI Schemes at http://www.iana.org/assignments/uri-schemes). However exceptions have been made for schemes to be registered even if they did not quite meet URI guidelines, if they were widely deployed. As a result, people tend to create a scheme, hope it will get widely deployed, and thus bypass guidelines and get registered.

The URI guidelines set a high bar, whose original purpose was to control the number of registered schemes. But people just keep inventing new schemes anyway and defer registration. Not surprisingly, there are now conflicting schemes (schemes with the same name, e.g., 'mmms:' has different interpretations used by 3GPP and Microsoft).

Namespace conflict is probably the most serious potential problem that URI technology faces. It was suggested to abandon the idea that registration will reduce total number, and that the primary purpose of registration should be to eliminate namespace conflicts. However the quality control advocates still want some barrier.

Suggestions:

  • A form of registry that might set some line -- schemes below the line are "not as good as" schemes above the line.
  • A provisional registration that provides a specification or an implementation pointer, for six months.
  • A rule that if a proposal already has a provisional registration and a specification, it wins.
  • A requirement that a proposed scheme have two different implementations.
  • Two classes of schemes: ones with a published specification, one without.
  • Discouragement of non-protocol schemes.
  • Register of implementations of URI schemes. Rather than setting a threshold ("must have at least 1 implementation") just document the values in the registry, and let people reach their own conclusions."If barriers are established, people will do whatever they do anyway."

There was discussion of abuse -- registering URI schemes with other people's trade names, etc. One suggestion is that perhaps multiple registrations for the same scheme might be allowed -- document the usage and let the antagonists fight it out. There isn't much sentiment for that suggestion, though --It was observed that the web simply doesn't work with conflicting namespaces schemes.

 

 

Top of Page Top of Page
  URI Resource Pages Home >> News
  The Library of Congress >> Standards
  October 28, 2008
Contact Us