Federal CIO Council XML Working Group
Meeting Minutes, October 17, 2001
GSA Headquarters, Auditorium

Please send all comments on/corrections to these minutes to Laura Green.

 

Working Group co-chair Owen Ambur convened the meeting at 9:00 a.m. at GSA Headquarters. Attendees introduced themselves.

 

XML Query

Jeff Griffith made a few introductory remarks regarding the Library of Congress’ interest in XML.  Mr. Griffith told the WG that the Library of Congress’ interest in XML stems from the fact that the House and Senate have begun to prepare congressional documents in XML for both printing and publication.  The Library of Congress has a retrieval system for these documents, and they are looking to leverage XML tag data for searching.

Paul Cotton of Microsoft then delivered a presentation on XML Query.  This presentation is available on the xml.gov website in both HTML and PowerPoint formats.

Mr. Cotton began by discussing the history of XML Query.

The earliest times of doing queries for XML dates back to doing queries for SGML

Recently at an SGML conference there was a “bakeoff” where vendors demonstrated SGML queries

This area is not new - what is new is the standard

I first became involved in this area in W3C in 1998

At that time, there was a different query language for each different application

Many of us came from database backgrounds - thought that this did not make sense

There was no consensus in the standards committee

In February 1998, Jonathan Robie and I and some others put together an XQL proposal

This proposal took some work done that was in the XPath area and tied to make it richer

In August 1998, a submission was made to W3C for XMLQL

Many of authors of that submission are on the XML Query Working Group today

About the same time, there were more and more efforts by people saying that we need a generic query language

W3C - when it is confronted with a new technical problem like this, rather than starting a new WG it often has a workshop to gather input and expose the original problem statement to a larger community

In November 1999, theW3C published the XPath recommendation at the same time as the XSLT recommendation

I wrote summary paper of all other submissions along with someone else

The URL of the summary paper is in the presentation

You can also see the paper of David Maier

He gives 10 to 12 requirements of an XML query language

We checked our work with this paper

As a result of the workshop, there was a large amount of interest - database vendors came out in full force

This was a surprise to many people, because we went from 1998 were we said we didn’t need a query language to 1.5 years later

In July 1999, the WG was re-chartered as part of a re-charter of XML activity

The WG has been re-chartered twice by W3C

After the workshop, I wrote a charter for the Query WG

I’ve been spending 50 to 75% of my time with the XML Query WG

There are currently 30 W3C member companies

We have teleconferences about once a week

We publish working drafts every 3 months

We are working on a recommendation track to make XML Query into a W3C recommendation

The goal of the WG is to produce a data model for XML documents, a set of query operators for those documents and a query language based on those operators

In January 2000 we produced a requirements document

This is first step in all WG’s

You need to take the charter and then expand it into requirements statements that can be used to test the success of the product

In May 2000, the first version of the data model was produced

In May 2000, we were also confronted by XML Schema

When a WG thinks it has a functionally complete spec (by comparing it with the requirements document), that stage is called “last call”

The name “last call” is a warning to other people in the W3C and the public that this is the last time we will ask you for your comments on functional completeness

This is somewhat of a strange circumstance to find ourselves in

Those who have backgrounds in IT and OO will find it strange to have one WG working on a data model and another WG working on the operations on that data model

The XML Schema specification tells you whether or not an instance adheres to the specification

There are no operators involved

In May 2000, we tried to figure out whether the type system being defined by XML Schema was anything close to what we can build operators on

We have a very close relationship with the XML Schema WG

Since last summer we have been combining our face-to-face meetings

This set XML Query back at least 3 to 4 months in its overall schedule because of needing to examine XML Schema

On May 2, XML Schema became a recommendation

In August 2000, we revised the requirements document with a fair number of use cases

I believe very strongly in use cases - having these along with a set of sample queries is very useful

If you go to Microsoft web site where you can execute XQuery, you will find many of these use cases

In December 2000,  a query algebra document (set of operators) was published 

In Feb. 2001, we re-published the query requirement - we carved out use cases into their own document because of how important they were

In June 2001, we released a complete new set of working drafts

As soon as you see this list you will realize that this is no small effort

When XML Query is done it will probably be one of the largest specifications that W3C has ever done

In printed form it is between 300 and 500 pages

XPath was published in 1999

18 months ago, you had the Query WG doing a data model, you had XPath doing a data model, and DOM has an implicit data model

In the W3C, it looked like every WG had its own data model

Now, XSL and Query WG have gotten together to create a single data model

This will be the basis for XML Query 1.0 and XPath 2.0

Some of the sample queries will look like XPath statements

In June 2001, we published the syntax for the language a second time - called XQuery

Name of WG is XQuery working group

Finding a name for a WG is tricky - we went out on the Web and found that only Software AG was using that name

Software AG agreed to let us use that name

Regarding databases of queries:  If you store your results away, you may want to take an incoming query and query your database of queries to see if that query exists already

Having an XML representation of queries that you can run XQuerys on is a very powerful use case

I’d like to touch on the data model requirements

XML 1.0 tells what is acceptable as input to an XML parser

But nowhere does it tell you what a parser should provide the execution environment that is using that parser

W3C Infoset is the specification that gives that information

It says that if you have a conforming XML parser, you should provide the following information about the well-formedness or validity of the document

PSV - Post Schema Validation infoset - is interface between XQuery and Schema

PSV takes the infoset that you get from parsing a document, along with additional information from the PSVI

It combines these and produces a dataset

Many vendors will give you an XML representation of what is in your database - the will have you run queries over these “XML views”

With this model,  you can take the PSV and PSVI of that XML view and generate a data model that XQuery then runs over

Since you run XQuery over the data model (which doesn’t actually exist - it is an abstraction) we open up the possibility of using XQuery over the Web as long as people can provide their data in XML format

The Library of Congress gave us use cases saying these are the kinds of queries we want to do

This is the best possible way for us to understand what people need

The XQuery language is a completely functional language

It can be nested with full generality, unlike SQL

The input and output of XQuery are actually instances of the data model

This is extremely important because it allows you to process “virtual XML”

XQuery is based on predecessor query languages - OQL, SQL, XML-QL, XPath

You can also add XQL (was left off the list)

We also have XQueryX - an XML representation of an XQuery

We are continuing to ask for public feedback on what the form of the language should look like

Jonathan will now perform a demonstration

Mr. Robie asked audience about their XML experience and gave presentation of various XML Query operations

We will talk about how to create an XML structure, how to identify nodes in XML structure, and how to restructure data from one format to another format

For example, if you have a set of invoices you may want to generate “customers by geographic region” - this has a different format from the original doc document

To create structures, just type them in

That is an element (typed in an element)

Executed query and showed result: “This is an element”

Talked about white space and its significance

We can also look for structures within XML

Demonstrated the use of attributes

Next we will talk about looking for things within XML structures

To do that, I need to know where the XML document is that I will be querying

I will take a bibliography as an example - a list of books written by various authors

We have multiple books by same author, multiple books by the same publisher

// - means “somewhere within the document”

/ - only finds things at top level

We are looking for authors in a database - we don’t want to see an author twice

To accomplish this, we use “distinct” to distinguish unique values

We can put in predicates, etc. because we have XPath available to us

But we only support currently the “abbreviated” form of XPath

To find a node within an XML doc, we start restructuring what we have here in a couple of different ways

We will start with a FLWR expression

We say “for each author”…

What kind of a result do we want?

First, we will write an element constructor to create a “books by author” element (/booksByAuthor)

What do you think this query will give?

We need to put curly braces around it to say “execute this thing”

We want to list the books written by each author

First, we take the variable and put it in an element that will separate the books written by this author from books written by other authors

Changed “booksByAuthor” tag to “author”

Then we create the “name” element

Now, we need books written by the author - there are a couple of ways to do that

We can use the “let” clause - set $b equal to the set of books written by one author

Put predicate at end ([]) - set author to $a

Now can put b variable within {‘s - but you really only want titles of books

This is a fairly different structure from the original structure we had - authors were at the bottom, now they are at the top

Talked about car/wheels analogy - do you want to use the entire car and put it in a garage, or use its parts?

XQuery - its approach gives best of both worlds

There is a bug in whitespace in the demo - demonstrated it by using the string function to get rid of first and last element in another kind of restructure

Next we want to take a different kind of document data - we are going to take a form of very loosely structured data that comes from early work in the medical community

We have an operating procedure - this comes out of the HL7 demos

We want to get some information out of this document

The first point I’d like to make is that although many people like to talk about documents and data, you and I read documents to get information out of them

There is a second kind of data - data easily managed by our software systems

This data can be put into rows and columns

We may want to find the average temperature of a patient over a period of time

We may want to find out what instruments are being used for surgery

That is a very data-like kind of query being done on a document

Another thing I might want to know is about incisions

Showed text with various tags interspersed

Joe Carmel (US House of Representatives) asked if you can you search for elements within documents as well - we are only searching for tags now

Mr. Robie demonstrated this by updating a query to search for “electrocautery” word in “incision” element - this worked

In documents, the sequence is important - many times the sequence makes a difference in meaningfulness of data

To search for procedures, can look for sections whose title is “procedure”

Sections and titles are represented as “//section.title”

Query worked - returned a procedure

But we really want to set a condition

FLWR = For, Let, Where, Return

For - assigns a variable to each value returned by an expression

Can iterate over a set of procedures - i.e. for each procedure the value of “p” will be one of those nodes - get one return for each procedure

Let - just returns the variable, does not iterate over it

Where - allows you to put a condition on the for and let - if it is not satisfied the return will not be executed

Now will combine where with quantifiers - expresses a condition ranging over a collection

Demonstrated using $I for incision - for some incision satisfying a condition (use “satisfies” clause) - condition is that there is an anesthesia before an incision

Executed query

Then changed to anesthesia after incision

Back to Paul Cotton

Mr. Cotton talked about current XQuery issues

In August, published the first version of functions and operators document - public says it is too big

Internationalization issues - text retrieval in English only will not be useful as an international format

Here are mine and Jon’s e-mail addresses

There is a feedback email list - this is where the Library of Congress sent use cases

There is also a public feedback e-mail list

Joe Carmel (US House of Representatives) asked about the timeframe for Candidate Recommendation

Mr. Cotton stated that this is both a technical question and political question

The technical part addresses “When will it be done?”

On the political side, schedules are member-confidential

The documents are published today with an issues list - you can look at the open issues list

We will publish a few more working drafts, then it will be last call

It’s hard to say when Candidate Recommendation will happen

I should point out CR in W3C is totally optional - it was instituted about 18 months ago to force workgroups to demonstrate interoperability before going to implementation stage

Jerome Yurow (DOE) asked about the difference between PSV and PSVI

Mr. Robie responded that PSV means Post-Schema Validation, while PSVI means Post-Schema Validation Infoset - i.e. the information you add to an infoset after the validation of a document,

Mr. Cotton added that this includes type information, elements and attributes, etc

Regarding optimization of XQuery:  There is a well-published set of literature - at Stanford - about the optimization of XPath expressions - whether you should execute the expression from top-down or bottom-up

Nested for’s in XQuery are no different than joins

Full-text queries are very appropriate to use in XQuery

ISO adopted a standard on how to perform an XML query on top of an SQL database - IBM and Oracle are co-authors

IBM and Oracle would not commit to this if they had any doubts about the ability to optimize XQuery

Optimizing XQuery is currently a very hot topic

Ed Luczak (DOE) asked if XQuery will allow a user or agent to perform a query over a collection of XML documents instead of a single document

Mr. Robie responded “definitely”

10-minute break

Pat Case then gave a presentation on Library of Congress Use Cases and Recommendations for Text Operators and Functions.  This presentation is available online in HTML format.

Everyone has wild expectations of what XML will do for them

The Library of Congress has another set of wild expectations - that XML will spawn one standard query language that will handle both structured text and full text

That it will let us query elements and their descendants

We would also like to see a complete robust set of text operations and functions

We would like consistent implementations by search engine and commercial database vendors

I don’t think that we are the only ones needing full text searches

We will end up with semi-structured data or, if we have money, full-structured data

Ms. Case then demonstrated use cases

We produced a number of bills in XML-tagged form and specified searches we would like to use

We specified the exact search operators: proximity operators (ordered/unordered), relevance operators, case, diacritics, etc.

Why do we need these text operators?  Here are some examples

Proximity operator - looking for bills on elementary education in the Legislative Information System (LIS)

This is similar to the Thomas system, but we have advanced search pages with Boolean operators that are not available to general public

We cannot search within titles, because it is full text  - we are hoping to be able to do that with XQuery

Performed a search for “elementary education”

It will also allow intervening words, because you almost never see “elementary education” in a bill

For example, may see “A bill to improve elementary education”, or “A bill to improve elementary and secondary education”

Demonstrated use of “pre/3” operator - “elementary pre/3 education”

We also need thesaurus support

For instance, with thesaurus support, you could type in “congressman” and it would automatically search on “senators” and “representatives”

I sincerely believe that when you get more than 3or 4 words in a tag in XML you need the full text support and proximity operators

Demonstrated truncation - “what did X say about Y”?

Used “robert pre/2 gates” example

Found string “Gate commented at length…”

We have a store of searches that are available to all reference librarians and those that serve CRS itself

This is a “search of last resort” - it is the kind of search you can build under a GUI for a novice user

We’ve asked the W3C working group to do some difficult things - if XQuery comes out without these tools we are stuck

We also had the goal to ask them for operators they have never seen before - for example,  the Ignore operator

For example, you can search for estate tax but exclude (“not out”) real estate tax

This is a scary query for a librarian to execute because she will probably query on estate tax and then take the results and perform an additional query on them

But what if a <news-story> tag contains “To eliminate the estate tax” and “Subject to real estate tax” in 2 <text> tags?

A NOT operator will not include this in results (i.e. it will be incorrect), but the Ignore operator will simply ignore the second <text> tag

Demonstrated case study:  The Congressional Record from Full Text to XML

Demonstrated the search interface in LIS

You can search on any congress in the database

You can limit your search to a House or Senate session

You can search by the person (member or representative or senator)

We have automatic stemming (sing/plural), also offers truncation

We have connectors, functions and operators

If you know the exact title of a bill that is the best way to ago - it will avoid bad hits

Otherwise, you can search on text in title

When I get in XML, I want control over this data and let people do some fast, easy searches

There are some things I cannot do without XQuery - for example, a search across congresses

If we get enough data tagged we will have control like never before

Some people type in a bill number and get nonsense back because members often do not speak the bill number on the floor

But if we can tag within the debate tag, we can allow them to search for bill numbers

Committees and subcommittees - it is not easy to search on these right now

Word/Phrase box - demonstrated stemming, truncation

We would like thesaurus functions

We would like to let people type in FBI instead of Federal Bureau of Investigation

We would like to take advantage of the XQuery operators that are available now

To gain access to conference reports, you now have to use Search Tips

Regarding documents within the record - we would like to have people to get to them as quickly as possible

You cannot get to these directly now

There is also a print index to the congressional record

You can enter a specific date for which to browse the index

There is a directory of XML search engines on the Library of Congress web site

GoXMLSearch is one search engine on the list

What we don’t have on the list yet are the big folks like IBM - I hope they might offer full-text search capability and join this search

This is my wish list that I’ve delivered to the W3C working group

I think within government we have a very big need for full text

 

General Discussion

Owen – Mark raises issue regarding 4 focus areas called out in charter and need to begin addressing these.

Mark – charter identifies specific actions.  Current monthly meetings are more focused on marketing and information.  Strongly encourage the committee to hold separate working session with representation of all govt agencies to work on other areas.

Michael Jacobs – real need for design guidelines and policy.  DON is developing these.

Steve Vineski – EPA also looking for policy and design guidelines.  In the absence of federal –wide positions, EPA is being forced to develop their own.

Marion Believe these should be addressed. Will DoN and EPA share their work?

Michael Jacobs-DON will.

Steve Vineski – EPA will.

Owen – Am certainly open to having the group work on policy recommendations, as Mark has suggested.  However, folks will need to step forward to offer and contribute to such proposals for advancement through the CIO Council for consideration by OMB. Will alert the EIEITC tomorrow that interest has been expressed within the XML Working Group to advance some policy recommendations.

Dan Schneider – agencies going through gepea plan due next Monday it might be useful to keep an open mind and see what the outcome of all the agency updates to gepea are and maybe take some guidance from that when it is published.  We may get some revealing updates on what agencies might be putting their money into over the next year or two.

Recorded by Joe Chiusano, October 17, 2001.

 

List of Attendees:

Last Name

First Name

Organization

Ambur

Owen

Interior-FWS

Bennett

Daniel

CitizenContact.com

Cutting

Dean

State

Dalecky

Selene

GPO

Dodd

John

CSC

Douglass

Mike

Citrix Systems

Finley

Jack

GSA

Hunt

Jim

GSA

Jacobs

Michael

DON CIO

Kanaan

Muhan

DynCorp

Kern

Matt

Pci

Knight

Dolores

DTIC

LaPlant

Lisa

GPO

Luczak

Ed

CSC

Morgan

Bill

GSA

Reeves

Joel

GPO

Rice

Jim

Vitria

Schmidt

Elizabeth

Software AG

Schneider

Dan

DOJ

Shin

Dongwook

Futureexpert

Sinisgalli

Mike

Vitria

Smith

Rick

MPG

Stanco

Tony

GW CPI

Thunga

Ronjeeth

Humanmarkup.org

Turnbull

Susan

GSA

Vineski

Steve

EPA

Williams

Kevin

Blue Oxide

Yee

Theresa

LMI