Federal CIO
Council XML Working Group
Meeting Minutes, October 17, 2001
GSA Headquarters, Auditorium
Please send all comments on/corrections to these minutes to Laura Green.
Working Group co-chair Owen Ambur convened the meeting at 9:00 a.m. at GSA Headquarters. Attendees introduced themselves.
XML Query
Jeff Griffith made a few introductory remarks regarding the Library of Congress’ interest in XML. Mr. Griffith told the WG that the Library of Congress’ interest in XML stems from the fact that the House and Senate have begun to prepare congressional documents in XML for both printing and publication. The Library of Congress has a retrieval system for these documents, and they are looking to leverage XML tag data for searching.
Paul Cotton of Microsoft then delivered a presentation on XML Query. This presentation is available on the xml.gov website in both HTML and PowerPoint formats.
Mr. Cotton began by discussing the history of XML Query.
The earliest times of doing queries for XML dates back to doing queries for SGML
Recently at an SGML conference there was a “bakeoff” where vendors demonstrated SGML queries
This area is not new - what is new is the standard
I first became involved in this area in W3C in 1998
At that time, there was a different query language for each different application
Many of us came from database backgrounds - thought that this did not make sense
There was no consensus in the standards committee
In February 1998, Jonathan Robie and I and some others put together an XQL proposal
This proposal took some work done that was in the XPath area and tied to make it richer
In August 1998, a submission was made to W3C for XMLQL
Many of authors of that submission are on the XML Query Working Group today
About the same time, there were more and more efforts by people saying that we need a generic query language
W3C - when it is confronted with a new technical problem like this, rather than starting a new WG it often has a workshop to gather input and expose the original problem statement to a larger community
In November 1999, theW3C published the XPath recommendation at the same time as the XSLT recommendation
I wrote summary paper of all other submissions along with someone else
The URL of the summary paper is in the presentation
You can also see the paper of David Maier
He gives 10 to 12 requirements of an XML query language
We checked our work with this paper
As a result of the workshop, there was a large amount of interest - database vendors came out in full force
This was a surprise to many people, because we went from 1998 were we said we didn’t need a query language to 1.5 years later
In July 1999, the WG was re-chartered as part of a re-charter of XML activity
The WG has been re-chartered twice by W3C
After the workshop, I wrote a charter for the Query WG
I’ve been spending 50 to 75% of my time with the XML Query WG
There are currently 30 W3C member companies
We have teleconferences about once a week
We publish working drafts every 3 months
We are working on a recommendation track to make XML Query into a W3C recommendation
The goal of the WG is to produce a data model for XML documents, a set of query operators for those documents and a query language based on those operators
In January 2000 we produced a requirements document
This is first step in all WG’s
You need to take the charter and then expand it into requirements statements that can be used to test the success of the product
In May 2000, the first version of the data model was produced
In May 2000, we were also confronted by XML Schema
When a WG thinks it has a functionally complete spec (by comparing it with the requirements document), that stage is called “last call”
The name “last call” is a warning to other people in the W3C and the public that this is the last time we will ask you for your comments on functional completeness
This is somewhat of a strange circumstance to find ourselves in
Those who have backgrounds in IT and OO will find it strange to have one WG working on a data model and another WG working on the operations on that data model
The XML Schema specification tells you whether or not an instance adheres to the specification
There are no operators involved
In May 2000, we tried to figure out whether the type system being defined by XML Schema was anything close to what we can build operators on
We have a very close relationship with the XML Schema WG
Since last summer we have been combining our face-to-face meetings
This set XML Query back at least 3 to 4 months in its overall schedule because of needing to examine XML Schema
On May 2, XML Schema became a recommendation
In August 2000, we revised the requirements document with a fair number of use cases
I believe very strongly in use cases - having these along with a set of sample queries is very useful
If you go to Microsoft web site where you can execute XQuery, you will find many of these use cases
In December 2000, a query algebra document (set of operators) was published
In Feb. 2001, we re-published the query requirement - we carved out use cases into their own document because of how important they were
In June 2001, we released a complete new set of working drafts
As soon as you see this list you will realize that this is no small effort
When XML Query is done it will probably be one of the largest specifications that W3C has ever done
In printed form it is between 300 and 500 pages
XPath was published in 1999
18 months ago, you had the Query WG doing a data model, you had XPath doing a data model, and DOM has an implicit data model
In the W3C, it looked like every WG had its own data model
Now, XSL and Query WG have gotten together to create a single data model
This will be the basis for XML Query 1.0 and XPath 2.0
Some of the sample queries will look like XPath statements
In June 2001, we published the syntax for the language a second time - called XQuery
Name of WG is XQuery working group
Finding a name for a WG is tricky - we went out on the Web and found that only Software AG was using that name
Software AG agreed to let us use that name
Regarding databases of queries: If you store your results away, you may want to take an incoming query and query your database of queries to see if that query exists already
Having an XML representation of queries that you can run XQuerys on is a very powerful use case
I’d like to touch on the data model requirements
XML 1.0 tells what is acceptable as input to an XML parser
But nowhere does it tell you what a parser should provide the execution environment that is using that parser
W3C Infoset is the specification that gives that information
It says that if you have a conforming XML parser, you should provide the following information about the well-formedness or validity of the document
PSV - Post Schema Validation infoset - is interface between XQuery and Schema
PSV takes the infoset that you get from parsing a document, along with additional information from the PSVI
It combines these and produces a dataset
Many vendors will give you an XML representation of what is in your database - the will have you run queries over these “XML views”
With this model, you can take the PSV and PSVI of that XML view and generate a data model that XQuery then runs over
Since you run XQuery over the data model (which doesn’t actually exist - it is an abstraction) we open up the possibility of using XQuery over the Web as long as people can provide their data in XML format
The Library of Congress gave us use cases saying these are the kinds of queries we want to do
This is the best possible way for us to understand what people need
The XQuery language is a completely functional language
It can be nested with full generality, unlike SQL
The input and output of XQuery are actually instances of the data model
This is extremely important because it allows you to process “virtual XML”
XQuery is based on predecessor query languages - OQL, SQL, XML-QL, XPath
You can also add XQL (was left off the list)
We also have XQueryX - an XML representation of an XQuery
We are continuing to ask for public feedback on what the form of the language should look like
Jonathan will now perform a demonstration
Mr. Robie asked audience about their XML experience and gave presentation of various XML Query operations
We will talk about how to create an XML structure, how to identify nodes in XML structure, and how to restructure data from one format to another format
For example, if you have a set of invoices you may want to generate “customers by geographic region” - this has a different format from the original doc document
To create structures, just type them in
That is an element (typed in an element)
Executed query and showed result: “This is an element”
Talked about white space and its significance
We can also look for structures within XML
Demonstrated the use of attributes
Next we will talk about looking for things within XML structures
To do that, I need to know where the XML document is that I will be querying
I will take a bibliography as an example - a list of books written by various authors
We have multiple books by same author, multiple books by the same publisher
// - means “somewhere within the document”
/ - only finds things at top level
We are looking for authors in a database - we don’t want to see an author twice
To accomplish this, we use “distinct” to distinguish unique values
We can put in predicates, etc. because we have XPath available to us
But we only support currently the “abbreviated” form of XPath
To find a node within an XML doc, we start restructuring what we have here in a couple of different ways
We will start with a FLWR expression
We say “for each author”…
What kind of a result do we want?
First, we will write an element constructor to create a “books by author” element (/booksByAuthor)
What do you think this query will give?
We need to put curly braces around it to say “execute this thing”
We want to list the books written by each author
First, we take the variable and put it in an element that will separate the books written by this author from books written by other authors
Changed “booksByAuthor” tag to “author”
Then we create the “name” element
Now, we need books written by the author - there are a couple of ways to do that
We can use the “let” clause - set $b equal to the set of books written by one author
Put predicate at end ([]) - set author to $a
Now can put b variable within {‘s - but you really only want titles of books
This is a fairly different structure from the original structure we had - authors were at the bottom, now they are at the top
Talked about car/wheels analogy - do you want to use the entire car and put it in a garage, or use its parts?
XQuery - its approach gives best of both worlds
There is a bug in whitespace in the demo - demonstrated it by using the string function to get rid of first and last element in another kind of restructure
Next we want to take a different kind of document data - we are going to take a form of very loosely structured data that comes from early work in the medical community
We have an operating procedure - this comes out of the HL7 demos
We want to get some information out of this document
The first point I’d like to make is that although many people like to talk about documents and data, you and I read documents to get information out of them
There is a second kind of data - data easily managed by our software systems
This data can be put into rows and columns
We may want to find the average temperature of a patient over a period of time
We may want to find out what instruments are being used for surgery
That is a very data-like kind of query being done on a document
Another thing I might want to know is about incisions
Showed text with various tags interspersed
Joe Carmel (US House of Representatives) asked if you can you search for elements within documents as well - we are only searching for tags now
Mr. Robie demonstrated this by updating a query to search for “electrocautery” word in “incision” element - this worked
In documents, the sequence is important - many times the sequence makes a difference in meaningfulness of data
To search for procedures, can look for sections whose title is “procedure”
Sections and titles are represented as “//section.title”
Query worked - returned a procedure
But we really want to set a condition
FLWR = For, Let, Where, Return
For - assigns a variable to each value returned by an expression
Can iterate over a set of procedures - i.e. for each procedure the value of “p” will be one of those nodes - get one return for each procedure
Let - just returns the variable, does not iterate over it
Where - allows you to put a condition on the for and let - if it is not satisfied the return will not be executed
Now will combine where with quantifiers - expresses a condition ranging over a collection
Demonstrated using $I for incision - for some incision satisfying a condition (use “satisfies” clause) - condition is that there is an anesthesia before an incision
Executed query
Then changed to anesthesia after incision
Back to Paul Cotton
Mr. Cotton talked about current XQuery issues
In August, published the first version of functions and operators document - public says it is too big
Internationalization issues - text retrieval in English only will not be useful as an international format
Here are mine and Jon’s e-mail addresses
There is a feedback email list - this is where the Library of Congress sent use cases
There is also a public feedback e-mail list
Joe Carmel (US House of Representatives) asked about the timeframe for Candidate Recommendation
Mr. Cotton stated that this is both a technical question and political question
The technical part addresses “When will it be done?”
On the political side, schedules are member-confidential
The documents are published today with an issues list - you can look at the open issues list
We will publish a few more working drafts, then it will be last call
It’s hard to say when Candidate Recommendation will happen
I should point out CR in W3C is totally optional - it was instituted about 18 months ago to force workgroups to demonstrate interoperability before going to implementation stage
Jerome Yurow (DOE) asked about the difference between PSV and PSVI
Mr. Robie responded that PSV means Post-Schema Validation, while PSVI means Post-Schema Validation Infoset - i.e. the information you add to an infoset after the validation of a document,
Mr. Cotton added that this includes type information, elements and attributes, etc
Regarding optimization of XQuery: There is a well-published set of literature - at Stanford - about the optimization of XPath expressions - whether you should execute the expression from top-down or bottom-up
Nested for’s in XQuery are no different than joins
Full-text queries are very appropriate to use in XQuery
ISO adopted a standard on how to perform an XML query on top of an SQL database - IBM and Oracle are co-authors
IBM and Oracle would not commit to this if they had any doubts about the ability to optimize XQuery
Optimizing XQuery is currently a very hot topic
Ed Luczak (DOE) asked if XQuery will allow a user or agent to perform a query over a collection of XML documents instead of a single document
Mr. Robie responded “definitely”
10-minute break
Pat Case then gave a presentation on Library of Congress Use Cases and Recommendations for Text Operators and Functions. This presentation is available online in HTML format.
Everyone has wild expectations of what XML will do for them
The Library of Congress has another set of wild expectations - that XML will spawn one standard query language that will handle both structured text and full text
That it will let us query elements and their descendants
We would also like to see a complete robust set of text operations and functions
We would like consistent implementations by search engine and commercial database vendors
I don’t think that we are the only ones needing full text searches
We will end up with semi-structured data or, if we have money, full-structured data
Ms. Case then demonstrated use cases
We produced a number of bills in XML-tagged form and specified searches we would like to use
We specified the exact search operators: proximity operators (ordered/unordered), relevance operators, case, diacritics, etc.
Why do we need these text operators? Here are some examples
Proximity operator - looking for bills on elementary education in the Legislative Information System (LIS)
This is similar to the Thomas system, but we have advanced search pages with Boolean operators that are not available to general public
We cannot search within titles, because it is full text - we are hoping to be able to do that with XQuery
Performed a search for “elementary education”
It will also allow intervening words, because you almost never see “elementary education” in a bill
For example, may see “A bill to improve elementary education”, or “A bill to improve elementary and secondary education”
Demonstrated use of “pre/3” operator - “elementary pre/3 education”
We also need thesaurus support
For instance, with thesaurus support, you could type in “congressman” and it would automatically search on “senators” and “representatives”
I sincerely believe that when you get more than 3or 4 words in a tag in XML you need the full text support and proximity operators
Demonstrated truncation - “what did X say about Y”?
Used “robert pre/2 gates” example
Found string “Gate commented at length…”
We have a store of searches that are available to all reference librarians and those that serve CRS itself
This is a “search of last resort” - it is the kind of search you can build under a GUI for a novice user
We’ve asked the W3C working group to do some difficult things - if XQuery comes out without these tools we are stuck
We also had the goal to ask them for operators they have never seen before - for example, the Ignore operator
For example, you can search for estate tax but exclude (“not out”) real estate tax
This is a scary query for a librarian to execute because she will probably query on estate tax and then take the results and perform an additional query on them
But what if a <news-story> tag contains “To eliminate the estate tax” and “Subject to real estate tax” in 2 <text> tags?
A NOT operator will not include this in results (i.e. it will be incorrect), but the Ignore operator will simply ignore the second <text> tag
Demonstrated case study: The Congressional Record from Full Text to XML
Demonstrated the search interface in LIS
You can search on any congress in the database
You can limit your search to a House or Senate session
You can search by the person (member or representative or senator)
We have automatic stemming (sing/plural), also offers truncation
We have connectors, functions and operators
If you know the exact title of a bill that is the best way to ago - it will avoid bad hits
Otherwise, you can search on text in title
When I get in XML, I want control over this data and let people do some fast, easy searches
There are some things I cannot do without XQuery - for example, a search across congresses
If we get enough data tagged we will have control like never before
Some people type in a bill number and get nonsense back because members often do not speak the bill number on the floor
But if we can tag within the debate tag, we can allow them to search for bill numbers
Committees and subcommittees - it is not easy to search on these right now
Word/Phrase box - demonstrated stemming, truncation
We would like thesaurus functions
We would like to let people type in FBI instead of Federal Bureau of Investigation
We would like to take advantage of the XQuery operators that are available now
To gain access to conference reports, you now have to use Search Tips
Regarding documents within the record - we would like to have people to get to them as quickly as possible
You cannot get to these directly now
There is also a print index to the congressional record
You can enter a specific date for which to browse the index
There is a directory of XML search engines on the Library of Congress web site
GoXMLSearch is one search engine on the list
What we don’t have on the list yet are the big folks like IBM - I hope they might offer full-text search capability and join this search
This is my wish list that I’ve delivered to the W3C working group
I think within government we have a very big need for full text
General
Discussion
Owen – Mark raises issue regarding 4 focus areas called out in charter and need to begin addressing these.
Mark – charter identifies specific actions. Current monthly meetings are more focused on marketing and information. Strongly encourage the committee to hold separate working session with representation of all govt agencies to work on other areas.
Michael Jacobs – real need for design guidelines and policy. DON is developing these.
Steve Vineski – EPA also looking for policy and design guidelines. In the absence of federal –wide positions, EPA is being forced to develop their own.
Marion Believe these should be addressed. Will DoN and EPA share their work?
Michael Jacobs-DON will.
Steve Vineski – EPA will.
Owen – Am certainly open to having the group work on policy recommendations, as Mark has suggested. However, folks will need to step forward to offer and contribute to such proposals for advancement through the CIO Council for consideration by OMB. Will alert the EIEITC tomorrow that interest has been expressed within the XML Working Group to advance some policy recommendations.
Dan Schneider – agencies going through gepea plan due next Monday it might be useful to keep an open mind and see what the outcome of all the agency updates to gepea are and maybe take some guidance from that when it is published. We may get some revealing updates on what agencies might be putting their money into over the next year or two.
Recorded by Joe Chiusano,
October 17, 2001.
List of Attendees:
Last Name |
First Name |
Organization |
Ambur |
Owen |
Interior-FWS |
Bennett |
Daniel |
CitizenContact.com |
Cutting |
Dean |
State |
Dalecky |
Selene |
GPO |
Dodd |
John |
CSC |
Douglass |
Mike |
Citrix Systems |
Finley |
Jack |
GSA |
Hunt |
Jim |
GSA |
Jacobs |
Michael |
DON CIO |
Kanaan |
Muhan |
DynCorp |
Kern |
Matt |
Pci |
Knight |
Dolores |
DTIC |
LaPlant |
Lisa |
GPO |
Luczak |
Ed |
CSC |
Morgan |
Bill |
GSA |
Reeves |
Joel |
GPO |
Rice |
Jim |
Vitria |
Schmidt |
Elizabeth |
Software AG |
Schneider |
Dan |
DOJ |
Shin |
Dongwook |
Futureexpert |
Sinisgalli |
Mike |
Vitria |
Smith |
Rick |
MPG |
Stanco |
Tony |
GW CPI |
Thunga |
Ronjeeth |
Humanmarkup.org |
Turnbull |
Susan |
GSA |
Vineski |
Steve |
EPA |
Williams |
Kevin |
Blue Oxide |
Yee |
Theresa |
LMI |