Federal CIO Council

XML Working Group

 

Wednesday, April 16, 2003 Meeting Minutes

 

GSA Headquarters

18th & F Streets, N.W, Room 5141

Washington DC 20405

 

Please send all comments or corrections to these minutes to Glenn Little at glittle@lmi.org.

 

Mr. Marion Royal:  I’d like to welcome everybody to the XML Working Group. We might be interrupted by a call from the front desk. We’re still having difficulty getting people up here. The minutes from the last meeting are on line. I hope you all had a chance to review them. If there are no objections, I motion that we approve them. [No objections.] I invite any announcements or updates to ongoing projects at this time. Owen [Ambur], do you have some announcements?

 

Mr. Ambur:  No, I don’t have any announcements at this point, except that I’d like to acknowledge and thank the good folks here at InduSoft in Hilton Head, South Carolina, for allowing me to use their facilities to dial and web into the meeting today.

 

[Editor’s note:  InduSoft’s Web site is at http://www.indusoft.com.]

 

Mr. Royal:  I’d like to share with this group—if you’re familiar with the three subcommittees under the [CIO Council] AIC [Architecture and Infrastructure Committee], we’re part of the Emerging Technology Subcommittee http://www.cio.gov/documents/Leveraging_Technology_Final_Draft.pdf, and there’s Governance [Enterprise Architecture Governance Subcommittee] http://www.cio.gov/documents/Governance_Final_Draft.pdf, and the third is the Components Subcommittee http://www.cio.gov/documents/Components_Final_Draft. pdf. All three have something on their work plans to do with registry/repository, and we’ve been trying to keep their efforts in alignment.

 

Yesterday I met with the subgroup working on the registry/repository for the Components Subcommittee. They agreed that our requirements map to their needs, so they’re supporting our efforts, and I’m very pleased about that.

 

If there are no other announcements, we’ll go right into Betty Harvey’s presentation on Taxonomies. Betty?

 

Ms. Betty Harvey

Electronic Commerce Connection, Inc.

XML Taxonomies

 

Ms. Betty Harvey:  Owen asked me to give a presentation on the work the Smithsonian has been doing and their taxonomy. I thought I’d talk generically about taxonomies, because when we first decided to do a presentation, the schema for the Smithsonian hadn’t been completed. I want to talk with you about taxonomies and good practices in developing taxonomies.

 

Slide 2  [What is a Taxonomy?]:  First, what is a taxonomy? “Taxonomy” is the buzzword of the day. It’s important to find out what it is, especially when we get into XML, because the definition changes a little bit. 

 

Slide 3  [Taxonomic Vocabulary]:  The definition of taxonomy—since I just watched “My Big Fat Greek Wedding,” I like this. It has a Greek origin, with two parts:

·         Taxis—meaning ‘arrangement” or “division”, and

·         Nomos—meaning “law.”

A taxon is an entity or group. Flora or fauna would be a group of taxons, or taxa. This is used a lot in the scientific realm. I’ll talk a little about the difference between taxonomies and ontologies. They have differences that are delineated in the XML world.

 

Slide 4  [Taxonomies are Universal]:  Taxonomies are universal. Everyone uses them every day. Oops! I deleted a bullet. There should be a “Yahoo” bullet. Yahoo has developed a taxonomy so that when using the search function, you can find your way around—as well as TV Guide—they’ve developed a taxonomy. Tivo is coming into everyone’s home. That’s one that’s about TV. We humans use taxonomies every day for grocery lists, to-do lists, contact lists, etc.

 

Slide 5  [Taxonomy vs. Classification vs. Ontology]:  This definition is from the ebXML Registry/Repository listserve by Nicholas Berry:

 

·         A taxonomy is a hierarchical arrangement of topics that imposes topical structure on information in a specific body of knowledge.

 

·         Classification is the process of dividing objects or concepts into logically hierarchical classes, subclasses, and sub-subclasses based on the characteristics (attributes) they have in common and those that distinguish them. Note, this is the model upon which individual taxonomies may be built.

 

·         An ontology is a knowledge representation system, which presents the key concepts and relationships relevant to a body of knowledge.

 

If we look at it from the XML point of view, we see that delineation from an XML perspective. The XML schema defines the rules we develop the taxonomy for. Then classification is the information that goes into the rules for the taxonomy. Then the ontology we can look at as applications (KM [Knowledge Management] or Topic Maps) or things you add into the classification to the taxonomy.

 

Mr. Royal:  To solidify my understanding, is it always hierarchical?

 

Ms. Harvey:  Yes, but there are always anomalies. I’ll talk a little about that.

 

Mr. John Kane:  Would you equate a thesaurus with an ontology?

 

Ms. Harvey:  I would equate it with an ontology.

 

Slide 6  [XML Taxonomy]:  This is a diagram I put together for today. It shows how we break down offices in the federal government. I’ll talk more about it as we go along.

 

Slide 7  [XML Classification]:  If we look at this based on the definition we gave before, this is the information that goes into the schema. We have the “National Archives,” “phone,” “position,” “name”—which is Ben Franklin (I put “Ph.D., though he didn’t have one)—then “skills inventory.” This information can be processed into something that’s human-consumable. XML by itself is not really human-consumable.

 

Slide 8  [XML Ontology]:  An ontology is a knowledge management system based on the taxonomy/classification. A lot of people are looking at information that’s distributed, especially in the biology world. Universities are looking at ontologies. They’d like to have a distributed system and be able to search all the parts of the system to find the information they need.

 

Slide 9  [Developing an XML-based Taxonomy]:  [Introductory title slide]

 

Slide 10  [Resources Required for Developing an XML Taxonomy]:  So to develop a taxonomy, what do you need? You need human resources; you need the experts—the people who know what’s going on in that community. For instance, in biology we have full-time biologists and botanists. These people are the subject matter experts; we have information technologists—people who can look at data and put it in XML format—and we also need the users, to know how they’ll use it. We need to know who uses it, how they’ll search, and what they’ll search for.

 

Slide 11  [Understanding the Requirements]:  We need to understand the requirements—who uses the information, how they use it, what they use it for—and we need to know how granular the information needs to be. Then we need to know how the information will be searched and retrieved. In lots of the projects I’ve been involved in, the last two bullets [How the information will be searched; How the information will be retrieved] have been unknown.

 

Mr. Francis Hsu:  Lots of people don’t know how to track information, except with a catalogue online…

 

Ms. Harvey:  One of the things we talk about later is having metadata be the point of entry into the data. That’s an interesting concept as a point of entry for searching.

 

Mr. Hsu:  Do you conceptually think of metadata as related to taxonomy?

 

Ms. Harvey:  Yes. In some cases it’s the same; in some, it’s not.

 

Slide 12  [Information Analysis]:  You need to do an information analysis. You need to get it right, or the taxonomy won’t be useful. You need the relationships of the data and how it’s modeled. You can use bottom-up or top-down models.

 

Slide 13  [Develop a Common XML Architecture]:  You need a common XML architecture, for naming conventions (how you name things), abbreviations used in the taxonomy, elements versus attributes philosophy, how you incorporate standards, generated text within the data, etc. Depending upon the project, there’s a whole realm of issues.

 

Mr. Royal:  What do you mean by “generated text?”

 

Ms. Harvey:  From the BCA [Biologia Centrali-Americana] point of view—the ranks for classification like Class, Phyla, and Kingdom included in books—you had to decide whether to include them as information or let the labels be generated based on elements, attributes or hierarchical context; let the output system determine what’s generated. If you think of a book paradigm, the chapter label is generated by the tag “chapter” and the word "Chapter" is not included in the data.

 

Mr. Royal:  Does it include fields?

 

Ms. Harvey:  No. It depends on the project whether you calculate on the fly. I haven’t dealt with that.

 

Slide 14  [Consider Modularization]:  Next, you might want to consider modularization. If you look at the DTD [Document Type Definition] we looked at before, it’s the same as before, but “bio” is in yellow. I wanted to show that the DTD or schema can be modularized so individual stakeholders can author the data where they have the expertise. So we can pull “bio” out of the schema and have it created by the person responsible for biographies. You can have the XML data—if you look at “Franklin,” if you look at the bio, XLINK gets pulled in from Ben Franklin’s personal Web space, so that way, you can have a distributed system and the people who are the stakeholders can provide information. It provides dynamic information, distributed processing, and better security.

 

Mr. Williams:  Are you using XSLT?

 

Ms. Harvey:  Yes.

 

Mr. Royal:  Can you achieve modularity by using different schemas, for example a reusable library of schema elements or classes you can import?

 

Ms. Harvey:  ebXML is supposed to develop Core Components to be reusable, like “Name,” “Address,” etc.

 

Mr. Royal:  I’d like to mention two different things—Core Components is continuing at UN/CEFACT. It’s in final public comment. UBL has just this week announced they’ll serve as implementer of Core Components in UBL language. That work is continuing. I tend to use common components when referring to reusable elements, rather than Core Components, to get away from atomic data items.

 

Slide 16  [Good Taxonomy Practices]:  Use the vocabulary of the domain space—of the stakeholders—so they have a good understanding. An XML consultant might come into a project, and probably have a different vocabulary from the vocabulary of the stakeholders. It’s why you need the marrying of philosophies. You need to develop appropriate relationships between objects; develop reusable objects.

 

Mr. Hsu:  How do you differentiate between information and objects?

 

Ms. Harvey:  The hierarchical piece. For example, the bio is an information object. In XML, you have containers, and the content is the object at the lowest level. The information is the markup and the information.

 

Mr. Royal:  The information object could be the container, and could also be the leaf.

 

Ms. Harvey:  XML is a good format for taxonomies. Taxonomies are highly structured relationships for building. Taxonomies are appropriate for XML. This is the structural format in relationship building. XML lends itself to being authored in different places, so it’s good for distributed information.

 

Slides 17-19 [A Very Good Format for Taxonomies; Taxonomy and Relationships; Taxonomy is Highly Structured Data]:  Taxonomies are highly structured, and usually hierarchical, except when there are anomalies in the data. I almost always found anomalies in the data. Relationships are usually consistent.

 

Mr. Hsu:  When you say anomalies, do you mean the way it’s applied in a particular industry, or in general?

 

Slides 20 and 21  [Taxonomies Contain Anomalies; Genealogy]:  Generally in nature, rules are usually broken. You have to look for how they’re broken. For example, I have an example of a genealogy here. The rules are broken. Every taxonomy has a rule to be broken. Here’s my example of a family tree—actually my aunt and uncle. If you think of a family tree, it should be a tree (hierarchical). In this case, we have a circle. [Ms. Harvey showed an example that caused the circle]. In nature, it’s an anomaly. Something broke down in the family tree. It’s interesting. Genealogy programs don’t handle it very well. Being from West Virginia, I’m sure I have more skeletons in my closet.

 

Slide 22  [New vs. Legacy Taxa]:  This is confusing to people. It’s basically like the BCA. What we’re doing with them is going back to legacy data. It’s over 100 years old. They had no control over how it was authored, so we have to make the model looser than usual. When you’re looking at a taxonomy, you can look at it with a strategy for moving forward, yet accommodating the legacy data. So you have to have models described first. If you move forward with the same data model, you can tighten it a little; say one “A element” is required, and either “B element” or “C element.” What happens is, the second model will parse against the first.

 

Slide 23  [Biologia Centrali-Americana]:  [Title slide]

 

Slide 24  [Biologia Centrali-Americana]:  Now I’ll talk about the Smithsonian work. The BCA was research work performed in South America in the late 1800’s and early 1900’s. This slide shows some of the organizations that are involved in the project. The Smithsonian and the Natural History Museum of London have been the most active in the working group, and this project is funded by the National Science Foundation.

 

Slide 25  [BCA]:  The BCA has 63 volumes of data. It’s currently out of print. That’s the incentive for funding this project. The books are deteriorating. It’s the central repository for information on the study of Central America. It was compiled from the late 1800’s to the early 1900’s.

 

Slide 26  [BCA Goals]:  The first phase was the information analysis. The second phase was to create a global schema. Even though we’re working with BCA, the schema was supposed to be for all biological taxa for literature. It was difficult because we only had BCA’s to look at.

 

Mr. Ambur:  Is it intended to be established as a standard?

 

Ms. Harvey:  Yes—in fact, several museums want to use it on some of their taxonomic publications.

 

Mr. Royal:  What is the strategy for establishing the standard?

 

Ms. Harvey:  We put it out for public comment. I don’t know the strategy from there. There were lots of organizations involved up front. There’s also a taxonomy effort going on in the community for data that’s not literature.

 

Mr. Royal:  Is it still specific to museums?

 

Ms. Harvey:  No.

 

Ms. Lisa Weber:  What standards group are you looking toward? NIST?

 

Ms. Harvey:  No, it might be Global Biodiversity Information Facility (GBIF). We’ve been working with them on this. The next phase is to scan and rekey the data.

 

Mr. Kane:  Does that include OCR [Optical Character Reader]?

 

Ms. Harvey:  The books are in such bad shape that the pages will be rekeyed. Either a PDF or TIF representation of pages will be created and sent to the converter and rekeyed. The unknown piece is what XML database they’ll use or how the XML will be stored within the Smithsonian. The options are a relational database, which is already available, or a native XML database. Storage of XML is always an interesting topic when you are working on an XML project. The other major issue is linking the BCA to other biological data sets.

 

Mr. Royal:  Let me just note that I received word last week that OpenOffice will be able to publish in PDF. It also publishes in XML, so it’s looking more and more attractive.

 

Ms. Harvey:  I don’t know whether you’re aware of what’s happening with Microsoft Word. Supposedly, Office 11 could use schemas. In the lower version, it won’t let you. You have to purchase the Enterprise version to use customized Schemas.

 

Mr. Royal:  Or Professional.

 

Slide 27  [A Few Challenges in Developing BCA Taxonomy]:  The issue is whether to do recursive or explicit structuring. We had to deal with page fidelity. There’s a lot of literature out there that references BCA by page number. We had to be able to link back to those pages. When dealing with legacy data, you always have those issues. You don’t want to lose that information. We had to integrate metadata. There are lots of issues on how to do it. And, we had the integration of TEI-Lite [Text Encoding Initiative-Lite]. I’ll talk more on how they wanted to do that.

 

Slide 28  [Global Schema]:  We tried to make a global schema for taxonomic literature. We only had access to BCA, and we relied heavily on the museum experts. The biologists know the information, and they were the driving force. The model had to be loose enough to accommodate any type of data we found, yet structured enough to support and enhance the scientific value of the information. The BCA will be the first test of this schema. We had to accommodate both scientific and library services requirements—both of those domains.

 

Slide 29  [Biology Hierarchical Classifications]:  Hierarchical classifications. These were the hierarchical components we had to deal with, starting at Kingdom and going down to Subvariety.

 

Slide 30   [Ordered vs. Recursive Model]:  We had to deal with an ordered versus recursive model. We had three options:

 

·        Explicitly name the hierarchy, such as Kingdom to Subvariety

·        Explicit Hierarchical Model Using Number Algorithm

·        Have a recursive model. We decided on a recursive model, with the  “TaxonTreatment” element being the main recursive hierarchical model, and these could be in any order—for example, the BCA went “Class, Family, Subfamily, Genus.” There’s really no other way to do that than the recursive model, where the attribute describes the rank. In some cases, we didn’t put rank in. The biologist knows it’s a genus, but someone picking up the book for the first time wouldn’t know. So some of the information is implicit, but most of the information is explicit. Explicit is the default. We know that’s in the data.

 

Mr. Hsu:  When you say structured, an inexperienced reader wouldn’t know there’s a structure.

 

Ms. Harvey:  Right. When I looked at the data, anything above genus was explicit, and anything below it was not. By the structure of the book, the biologists know.

 

Mr. Williams:  Is the point to capture the information of the book, or to add implied information?

 

Ms. Harvey:  In the first phase, everything in the book. Anything they can glean from the book will be better. As time goes on…there’s been a lot of literature about the data. In the data itself, there’s little information about who does the collections, but historically people have collected and cataloged this type of information. It might be one of the things that gets married into the data using distributed data models. For the conversion, people who have a leaning toward biology will be involved with the conversion. Also, the museum biologists will be reviewing all converted information.

 

Mr. Hsu:  The way you structure reflects an understanding of the development at this time, so your structure is for now, and if the structure changes, how do you reflect it? It wouldn’t reflect how Darwin saw it.

 

Ms. Harvey:  I’m assuming it does reflect structure from a previous time.

 

Mr. Royal:  It’s philosophical—still a snapshot of our understanding. Do you have a way to roll back and look at earlier versions?

 

Ms. Harvey:  I don’t know whether we’ve taken that into account. I assume there are some issues there. That’s one issue that was brought up. Yes, I can show you the structure. There’s a way of identifying previous names and I can show you the new names.

 

Unidentified participant:  Do you have a change control mechanism?

 

Ms. Harvey:  Yes.

 

Slide 31  [Page Fidelity]:  There is a requirement to allow the end-users the ability to always retrieve the original page. We handled it globally. There’s an element we call “PageBreak.” Inside that we have attributes for page number, a link to the metadata, a link to the original page, and we also have other attributes with stuff like a page header, for instance. We don’t know if we want to capture everything on the page, so we capture that as well.

 

Mr. Royal:  I’m thinking of the heavy use of attributes. Why does it make sense over elements?

 

Ms. Harvey:  The physical page header does not add value to the electronic format. It does to the printed page, but not the data in electronic form. If you’re searching an electronic page, you don’t care if there’s a page header. The page header in the BCA is the taxonomic element you’re in, so if you’re in a genus, that’s the page header. In the XML version, the information is contained in a genus component.

 

Mr. Royal:  You’re using the attribute to control the structure?

 

Ms. Harvey:  In the recursive model?

 

Mr. Royal:  Yes.

 

Ms. Harvey:  The rationale was, we couldn’t control the explicit structure because of historically authored data. The taxonomy can be included, so you’re really getting structure, but using attributes to say what the structure represents. The other option was a generic taxonomy, but that’s confusing too.

 

Mr. Royal:  It’s time to ask another question. You’re talking about XML. Are you really using SGML?

 

Ms. Harvey:  No. I delivered a DTD and schema. They wanted a schema, but I delivered a DTD for cases where the software doesn’t support the schema.

 

Mr. Royal:  This is really a document that SGML is really all about.

 

Ms. Harvey:  There’s no software for SGML anymore.

 

Mr. Tony Byrne:  Why isn’t XML all about the document too?

 

Mr. Royal:  A lot has been stripped out of SGML, and a lot is slowly being put back in. It reminds me of other lightweight protocols that were found to be too light.

 

Ms. Harvey:  Schemas are harder to use than DTDs.

 

Mr. Royal:  They’re more flexible.

 

Ms. Harvey:  Yes and no. Going back to SGML, SGML DTDs were as flexible as schemas except for data types. For this taxonomic publication, the DTD is 800 lines. The schema is 12,000. We’re talking about a magnitude of complexity, which may or not prove to be useful.

 

Mr. Hsu:  Page fidelity is only valid for a physical form.

 

Ms. Harvey:  Yes.

 

Mr. Hsu:  So the notion of page fidelity is for historical purposes, and moving forward, you’re not concerned with it?

 

Ms. Harvey:  Yes. I don’t see the need for page fidelity going away in the near future. For instance, the House of Representatives still refers to hard copy. Within legislation, the legal document is the printed page. It will take an “Act of Congress” to change the legal document being the electronic version.

 

Slide 32  [Integration of Metadata]:  This is the real sticking point. There’s very little metadata in BCA, but a lot about BCA is available externally. There has been a lot of discussion about whether the metadata should be internal or external to the XML version. You’ll always have this discussion in every XML project —“it should be internal or external.” The metadata standard that will be used for BCA still hasn’t been solidified yet. We looked at a lot of standards. There are a lot of metadata standards out there. It’s difficult to determine what to select. We decided to have it external. There’s still some inside itself. BCA has something called a Fascical Footer. The Smithsonian Library decided to have the metadata be external. That way they can link to other metadata that’s been established elsewhere. We did it externally because there are a lot of other libraries that have information based on BCA.

 

Slide 33  [Integration of TEI]:  TEI is an XML standard for developing literature. The Smithsonian wanted to use it for front and back matter of the volumes because it’s a light encoding system, it’s universal, and also they want to use the header information for incorporating some of the metadata. So we broke up the model so the internal literature is the taxonomic publication schema and TEI is used for the front and back matter.

 

Slide 34  [Integration of TEI Solution]:  We have a taxonomic publication here. We also have a taxonomic volume—or we go right into a publication, because not all the taxonomic literature is volumes. BCA is volumes. In the front matter, we have TEI light, then here is the TaxonomicTreatment with one or more TaxonTreatments. These are the components that are inside it. If Owen is on the line, you’ll be glad to know we have a VernacularComponent name for the common species name.

 

Slide 35  [Relationship/Linking Attributes]:  The relationship/linking attributes: it was unknown at the time how it was going to be stored—whether in an Oracle or XML database. The concern was, if we used a relational database it would “chunk” the data and the data could get disconnected, so we developed three attributes to be used globally. We can identify a ParentNode, a SiblingNodeHigher above, and a SiblingNodeNext below, so if we use Oracle or SQL [Structured Query Language], or another relational database, we can always find the correct information order.

 

Mr. Hsu:   When you say they don’t mean users, do you mean architects?

 

Ms. Harvey:  During the population of the database..

 

Mr. Williams:  Is SiblingNode as important?

 

Ms. Harvey:  It is important, because we want it exactly as it’s seen on the page.

 

Slide 36  [BCA Example]:  I’ll show you a quick example. This is taxonomic data. It’s not BCA, but Anna Weitzman was a key biologist who helped on this. This is data she created. [Ms. Harvey clicked on the “Styled TaxonomicPublication Example” link, which linked to the following URL: http://www.eccnet.com:8080/cocoon/Freziera.xml.] Those of you in DoD—on integrated electronic technical manuals, where I got my start—they have a key, or decision tree, so what you do is look at a specimen and go down the decision tree: “Does it look like this, or that.” Everything is lined. [Ms. Harvey showed several descriptions of items.] This is the style data.

 

If we go back to the raw data, you’ll see that this is highly tagged. [Ms. Harvey navigated to the second link on Slide 36, titled “Raw TaxonomicPublication Data,” which linked to the following URL: http://www.eccnet.com/sil/examples/Freziera.xml.] There are a lot of synonyms, and very little verbiage. A lot of this is only known to the biologists.

 

Mr. Hsu:  What percentage is tagged data, and what percentage is content?

 

Ms. Harvey:  My guess is that 60 percent of it is tags.

 

Mr. Hsu:  So it’s very laborious.

 

Ms. Harvey:  Very laborious. That was where the decision was. We had librarians interested in metadata because they originally wanted TEI light, but biologists want it as robust as possible. It’s an issue because it’s more costly to convert into this format.

 

Mr. Bruce Cox:  It’ll pay back 100-fold in the centuries to come.

 

Ms. Harvey:  Yes—that’s always the issue with this kind of thing. There’s lots of leeway in the discussion. There are a lot of content-oriented discussion elements, like “Is this a geographic discussion, or a species discussion?” Only someone who knows the content intimately would know. Then we have a general discussion element for things where you can’t decide.

 

I think that’s it.

 

End presentation.

 

Mr. Royal:  Are there any other questions for Betty while Tony [Byrne] switches over?

 

Mr. Ambur:  If you look in the agenda I prepared for this meeting, I listed some taxonomies we might want to consider at future meetings. For example, one of the overriding objectives of the administration’s eGov initiative is to make applications citizen-centered. I’m not sure whether anyone has thought through what that means in terms of classification and XML schemas. I’d be interested if Tony has anything to say regarding FirstGov. I don’t know whether either Betty or Tony has comments in that regard. We need to figure out what those terms mean with respect to the way we implement our systems. Another highly relevant taxonomy is the FEA [Federal Enterprise Architecture] itself, particularly the Business Data Reference Model, the Data and Information Reference Model, and the Service Component Reference Model.

 

 

Mr. Tony Byrne

CMS Watch

OCSC / FirstGov Web CMS Project

 

Mr. Byrne:  Owen, this is what I’ll talk about in the next ten minutes—trying to introduce content management and implement it on FirstGov, then what that means for a citizen-centered taxonomy. This is a brief overview of a long-time-in-germination project. It’s now flowering. One purpose is to introduce you to it, and secondly to come back to the notion of a citizen-centered taxonomy. There’s a project in the Office of Citizen Services and Communication at GSA, which is the home of FirstGov. They’ve developed a Web content management system, in which other agencies can participate. Whenever you do this, immediately the issue of taxonomies and classification comes to the fore. I’ll talk about what we’ve done and what’s already happened at FirstGov. We’ve been working with a government contractor called InfoZen in this effort.

 

Slide 2  [Content Management System?]:  The purpose of a content management system [CMS] is to take content and add value by putting it through an approval process and applying business rules—who publishes, how long we archive, when we update—all the things you want to substantiate in a content management process. The system does that and outputs it in all kinds of electronic formats. This is a rough overview of what a content management system is supposed to do. The notion is, you get a higher quality, more auditable process, and you should be able to publish more content, faster, with the same or fewer resources.

 

Slide 3  [About the OCSC Shared CMS Project]:  Into the notion of “more, better, faster” with our content management system. The shared CMS project was initiated by FirstGov as part of Quicklsilver. They went to OMB [Office of Management and Budget], and said, “We need content management.” They said, “Since you’re FirstGov, produce it such that other agencies can participate, and achieve economies of scale.” Some of you may know the software selection has been made, and FirstGov and another site are the first pilots. We’re getting into the requirements phase.

 

That’s the physical infrastructure. There’s also a knowledge structure. On average, our estimate shows that the typical agency volume of information going to the public is growing 60 percent a year. The ultimate goal is to save money, and help the citizen get fresher, better content. The goal is to accelerate and improve content management for agencies that it’s already in, as well as assisting others going off on their own.

 

Slide 4  [The Government Wide Content Puzzle]:  Citizens come to federal websites with particular interests. It might be a tax question, permitting, loans, retirement, jobs, passports—lots of things. When you have a tax question, you generally go to the IRS, but very often, for instance, if you want to export overseas, you don’t know where to go.

 

There are any number of agencies with cross-cutting interests, and they’re publishing content about these issues. A number of good things happened here in the last year. Many agencies have done an intentions analysis, and are reorganizing websites in a topical way. For example, Export.gov is aggregating links from 13 other agencies involved in exporting, and putting in a pipeline so I can get access to this content. Doing this across the whole government is obviously a huge classification challenge. That’s where I come to FirstGov. It’s kind of what FirstGov does. It was tasked two years ago to be the first place to come for federal information.

 

If you talk about the four groups who should be participating in the content management process, they said, “Here’s how we’re going to reorganize the links.” Then reality hit, and people said, “No, I want these different things.” They’ve gone through three rounds of testing, and they reorganize every time.

 

They’ve also been put under the Federal Citizen Information Center. There’s a lot of data in from the call center about what citizens call about that’s also incorporated. They also have feedback forms, and they’ve gotten a lot of feedback on what citizens are interested in. They’re looking actively at data on how people are using the site and changing the organization of the content accordingly.

 

Slide 5:  [FirstGov: A Classification Scheme]:  There are the core breakdowns. One of the interesting things about this taxonomy is, they bubble stuff to the top based on perceived citizen interest.

 

Mr. Hsu:  How often does it take place?

 

Mr. Byrne:  Right now it’s not automated. Over time, it may be done on an ongoing basis. Another part is, sometime the folks at 1600 Pennsylvania Avenue suggest something important as well. Lots of different facets cut across the classification scheme. What you have here is a citizen-centered taxonomy of publicly available federal content. I think it’s interesting, and potentially very usable.

 

Slide 6  [Direct Content Updates Via CMS]:  Two things are going to happen in content management: first, the taxonomy will be codified, so we’ll have a schematic to look at and publish. The second is that, first on an experimental basis, then globally, content owners from agencies themselves will be able to manage, to some extent, their own content. They’re going to have to classify that content. There’ll be an agreed upon classification rule. This will spur healthy debate about the suitability of that content. I suspect there’ll be some pull and tug on how it’s used and classified, between FirstGov and the owners.

 

Mr. Royal:  There seems to be one other top-level not shown at FirstGov; that’s like “Visitor.” An international user might have a whole different set of things they’d look for. Has there been any talk on that?

 

Mr. Byrne:  I don’t know. That’s a whole different area.

 

Mr. Ambur:  There is also the thesaurus function with reference to different terminology used in regions of the U.S.  That is, there is the need for localization as well as internationalization.

 

Mr. Royal:  One of the reasons I mention it is, I spend a fair amount of time looking at what states are doing. They’re virtually all tuned in to what their visitors are doing.

 

Mr. Byrne:  I don’t know where that falls into this. It may be that that’s one of the many issues on how this is going to evolve.

 

Slide 7  [Citizen-Centered Taxonomy?]:  A taxonomy like this is really a living, breathing thing. I know other efforts are involved, like the Data Reference Model, but that’s slated for 2004. I think what you have at FirstGov is a good first cut. It can be immediately useful in other applications, where a citizen’s focus is critical.

 

I’ll close it with the quote from Joseph Busch. He worked with NASA on their taxonomy. He’ll be the person working on FirstGov. I’m hoping he can come in in a few months and brief the group. One of his lines I like is, “A taxonomy does not have to be perfect, just good enough.” I think the one at FirstGov is a good enough start. Now with the mandate to publish information across silos, it’ll need to evolve over time as other agencies participate. I’m excited about this. If you’re interested in CMS itself, the person to contact is Dana Hallman [dana.hallman@gsa.gov]. She couldn’t make it here today because she’s briefing the Governmentwide Portals group on the new announcement that the software has been selected. Feel free to contact her directly if your agency is interested in participating in Web content management.

 

Ms. Weber:  What software have they chosen?

 

Mr. Byrne:  Vignette’s.

 

Mr. David Heiser:  How does FirstGov plan to deal with language issues? I’ve gotten a requirement from my agency where we’re required to output more material in Spanish and possibly other languages.

 

Ms. Harvey:  From the Smithsonian perspective, the only language is Latin. I’ve done work with the Canadian government; they must put it in French and English. There are some XML taxonomies in French and English. XML lends itself to doing that. If you look at manufacturing—I know for Saab in Sweden, their car manuals have to be in six to eight languages. There are ways to do it in XML.

 

Mr. Byrne:  They’re doing a visual and organizational overhaul as we speak. Once they’re done and the content management is in place, they’re looking at some degree of implementing Spanish. They’re discussing it with their Canadian counterparts. There are a number of sharing exercises with Ottowa going on. One of the issues on the agenda is, we’ll be learning how to deal with language, with schema, workflow, and other issues.

 

Mr. Hsu:  An important point is to separate it into two groupings: language and functionality. The functionality should be consistent; you just have to attach it to the language. If it’s the same functionality, you should be able to point to it.

 

Mr. Alan Kotok:  To add to this discussion, there is a bureau in the State Department for public diplomacy—headed with an undersecretary—and their interest would be beyond tourism. It’s a very good resource for finding out what federal government operations are about and making them usable for visitors (in this case Web visitors). Would you think that would be important?

 

Other participant:  Are there any federal guidelines on what federal agencies are supposed to do in this respect? If you look across agencies, they’re very inconsistent.

 

Mr. Heiser:  There’s an Executive Order signed by Clinton, and Bush followed through with it. It requires agencies to look at providing services in other languages. The Department of Justice has created regulations. My agency has been funded to try to provide a solution.

 

Mr. Hsu:  What agency?

 

Mr. Heiser:  IRS.

 

Mr. Ambur:  I want to let you know that some of the people far from the microphone are cutting off there. I’d like to pick up on Tony’s comment on the relationship between the FirstGov classification system and the FEA. If the FirstGov classification scheme is not the first cut for the Data Reference Model and Service Reference Model for citizen centricity, then it would be good to know what will be used for that purpose. It would seem logical to reuse the FirstGov classification scheme.

 

Mr. Royal:  We need to establish connectivity between the people working on the Data Reference Model and the folks at FirstGov.

 

Mr. Ambur:  The business model is already there.  It would be good to map the linkages between it and the FirstGov taxonomy.

 

Unidentified participant:  Is anyone at FirstGov working on a citizen centric taxonomy with their search engine and their new content management system?

 

Mr. Byrne:  One of FirstGov’s offerings is the FAST search engine as a hosted solution. With any engine, the quality of your metadata is a factor in the quality of the search results. One of the goals of this is to improve metadata standards across the federal space and use them to tune the search engine efforts. One of the things about setting and enforcing metadata standards is, it’s easier when you have a content management system. I guess our hope is to get together to create a set of metadata standards and offer people access to Web content management tools to help implement those standards.

 

End presentation.

 

 

Mr. Royal:  Now I’d like to invite Theresa [Yee] to tell us about developing the X12 standards.

 

Ms. Theresa Yee

LMI

Developing the X12 XML Standard

 

Ms. Yee:  Thank you very much. This endeavor started about two years ago, all occurring around developing XML. It seemed that ebXML had not gotten there, so we at X12 tried to pick up the banner and talk about merging the efforts.

 

Slide 2  [X12 XML Invoice]:  I’m going to give you a high-level view, because even though we have the architecture, we don’t have the design rules finalized. It’s a back-and-forth effort. I’ll walk you through what we’ve accomplished & discuss the design rules that we used in creating the X12 XML invoice using the CICA architecture.

 

Slide 3  [Agenda]:  This is where we’re going today, and I’ll talk about the bullets. Are there any EDI folks in the room? If you do know some aspects of EDI, I’ll clarify where we are on the standard. We’re not copying EDI. We’re developing an entirely different standard in XML. Then I’ll talk specifically about XML and how we’re applying the standard to it.

 

Slide 4  [Players]:  We’re under the American National Standards Institute X12 group. I’ll talk about how we represent XML as we complete the standard. I’m chairing Task Group 1, Work Group 5, under the X12 Finance Subcommittee.

 

Slide 5  [Players]:  These are the different people as well. As you can see, all different areas of industry and government are working together. It’s a nice cross-mix of people.  This is the Work Group working on the invoice portion of the development. In terms of the overall architecture, it’s the “C” and J subcommittees.

 

Mr. Hsu:  Is LMI government or private?

 

Ms. Yee:  Non-profit. We work only with other non-profits, such as the federal government.

 

Ms. Weber:  How many subcommittees are there?

 

Ms. Yee:  The creators of the architecture came from the Communications/Control or C Subcommittee. The J Subcommittee is responsible for technical assessment. “Finance” and “Transportation” Subcommittees are developing XML transactions. Those are the key players. Eventually all the X12 subcommittees will be involved in XML

 

Mr. Kotok:  Essentially the subcommittees are defined by industry utilization groups. There’s a Government Subcommittee.

 

Ms. Yee:  Teresa Sorrenti it is Chair of the Government Subcommittee. All X12 Subcommittees represent functional areas.

 

Slide 6  [Objectives]:  What were some of the main objectives? The first to was to make the standard we define easy to use. We found in the EDI standard that the greatest benefit went to large organizations because of the difficulty in implementation and cost. We wanted to avoid that, and make it easy on the small “Mom and Pops,” and all organizations across the board..

 

The second thing was, we wanted to move toward alignment with the UN/CEFACT ebXML standard [http://www.ebxml.org/]. We’re looking at the Core Components Types in ebXML. They should be equal to our Primitive Layer. I’ll go through all of our architecture with you.

 

Slide 7  [Approach]:  Our approach is based on the Reference Model for XML Design-TR1. You can find it at this URL: http://www.x12.org/x12org/index.cfm. Also—the draft design rules are out. They’re still in draft form and changing, so we continue to refine them as we develop the standard. We’re taking an overall look at this. If you think of the different types of documents—because we can all envision what a document is—we’re defining the business process, then from the business process creating the type of document we need, then from there filling in all the details.

 

That’s from the invoice side of the house. From the architecture side of the house, it’s a very modular process. Try to think of them as Lego blocks that allow the user to plug and play with various parts of the architecture and reuse what is needed. That’s what we’re doing with the standard. What we’re not doing is replicating EDI. We’re not using every single code in EDI. For people on the telephone, now we’re moving to Slide 8.

 

Slide 8  [CICA]:  The architecture is called CICA for short. It stands for “Context Inspired Component Architecture.” You can see each of the eight layers here. I’ll go into them with you. As we take on each of these, they’re stored in the CICA database. This is key. Then anyone can go in, access the document that fits their business need. When using the X12 XML document, a schema will automatically be created.

 

Slide 9  [Structure]:  From a graphical perspective, it’s easier to explain the pieces in a picture. Here’s the overall generic invoice template. From here a document, or type of invoice is created. The next piece to the architecture is a Slot, and the next is a Module. Within a Module, you have Blocks. When you group Blocks, it creates an Assembly. There are two layers under that—a Component, and a Primitive. The primitive value is equal to the ebXML Core Component Type or CCT.

 

Slide 10  [Structure: CICA Architecture Layers]:  The structure layers are on the left. I won’t read the definitions. For the example, I’m going to use my invoice. You have a template that’s an XML invoice. The document is a specific XML invoice. In this world, you have many types of invoices. For instance, if you sell pencils, you send an invoice; that’s one type. If you build a large jet fighter, you’re not going to build the entire fighter and then submit an invoice. You contractually agree that when you build this part, you bill; then you build the next part and you bill. It’s called “progress payment.” It could go on and on.

 

From the document level, we specify the type of invoice we create. For every invoice, we always require at least five different pieces of information. You always have a buyer, seller, product (what was bought), the price, and date. With that, you go into a module, specify the details that make the invoices different from one another, and they become pieces of the architecture—modules, blocks, components, and primitives. If we look at the buyer in an invoice, in the CICA architecture, it is a module, with all the component parts of the buyer. The buyer is comprised of party, address, etc. If we group the data describing the buyer within the module, where each grouping is a block, then we have an assembly. In this example, assembly is the Organization composed of the address block and organization party block. If we look at just one block, for example organization party, we can define it using a component, or ID number, or DUNS number. The last CICA layer is the primitive and that is the ID value or the actual DUNS number.

 

Slide 11  [Structure:  EDI - XML comparison]:  This and the next chart is a comparison of EDI to X12 XML. It shows that a transaction set equates to a template. This isn’t exactly a one-to-one comparison, but it provides an understanding of how the CICA architecture is structured. “Minor generic loop” maps to an assembly, then “Moderate segment” to a block, “Single data element or Composite element” to a Component, and “Single piece of data” maps to a Primitive.

 

Mr. Royal:  These layers—in the naming of these, is this new out of X12? Did you devise them or adopt them?

 

Ms. Yee:  They were developed by X12.

 

Mr. Royal:  Developed specifically for XML?

 

Mw. Yee:  Yes. This is a comparison. The next slide will show the differences between EDI and XML in the structure.

 

Slide 12  [Differences]:  In EDI, there is a limited number of segments and the formats are of variable length. In XML, there is no limit on the number of documents, but the file size is large. We have “REF” segments that are overused in EDI. In XML, we have tests to determine their usage. EDI requires translation software; in XML, we have parsers. EDI requires data mapping to the standard. For CICA XML, we have an “instance” schema available on the DISA website. From the XML templates, we develop documents with all the data requirements—the schema is automatically created behind the scenes and will be available on the DISA website. You and your trading partner agree on the type of invoice you will be sending/receiving, and the schema is automatically created.

 

Number 6—In EDI, we have a very restricted structure. In XML, it’s very modular, with “plug and play” flexibility that we don’t have in EDI today. We use existing structures before creating new ones. We use an existing one, or we can easily create one if we need to. If we don’t see that we have everything we need to use, we can create it. EDI provides an “if, then” capability that we are currently working to try to capture in XML.

 

Slide 13  [Process Model]:  How did we come up with what we need to do first to use the structure? We knew we had to understand the business process. You need to understand what your trading partner needs as well, because … [next slide]

 

Slide 14  [Invoice documents]:  …that’s how you develop the CICA documents. So you have invoices that are product-oriented, time-based, event-based, services-oriented, etc. This isn’t the whole set, but it’s the set we developed to apply to the X12 XML standard.

 

Slide 15  [Business Scenarios]:  How do you plug and play with these different layers? The business scenario, for example, first requires that a purchase order be sent to the vendor. After the order is filled, the vendor sends an invoice. The buying party then pays the invoice using a payment order. Now let’s look at the parts of each document and determine what can be reused. The purchase order has a “Buyer” and a “Seller.” The invoice has the same Buyer and Seller, and uses the same blocks. However, on the payment order, instead of Buyer and Seller, you have a “Payer” and a “Payee.” What are they? Payer and Payee are the same as Buyer and Seller, so the same blocks and modules that comprise Buyer and Seller are the ones that comprise Payer and Payee.

 

Slide 16  [Invoice CICA]:  This is to show you, visually, how plug and play works. In Invoice 1, you may have a Buyer; in Invoice 2, you have a Buyer. In each, we have a Seller. The Seller can be the same Seller. Depending upon the size of the organization, they can have the same Buyer, or if it’s huge—say GM is buying from 3M—maybe one part of GM wants sticky notes, and the other part wants scotch tape. The Buyers may be different.

 

Slide 17  [CICA —> XML]:  What’s key in all of this? You determine the business process. Then you select the document that represents your business process from the DISA website. When you use your CICA document, a schema is built behind the scenes to allow you to send and receive automatically.

 

Mr. Royal:  I go to DISA (say I need a purchase order). So I get this template and fill in radio buttons and get the order I need to do business?

 

Ms. Yee:  There are many types there. You’ll select the one that fits your need. If you don’t see the one you need, we’ll create one. Hopefully, if you represent the federal government, there’ll be one there. So you populate the one you select, agree with your trading partner that this is what you use, and then send it off.

 

Mr. Royal:  That means a potential vendor needs to be capable of supporting all the purchase orders on your site, and would need to support all the schemas.

 

Ms. Yee:  Theoretically yes—but if you’re targeting one industry or group, the chances are you won’t have to do all of it. For example, there is a utility invoice that most vendors will never use. 

 

Mr. Williams:  You talk about variation, etc. Are you allowing for the same type of variation for buyers?

 

Ms. Yee:  Yes.

 

Mr. Williams:  How do you handle Cartesian results (5 buyers, 5 sellers)? 25 different schemas? Is that how you’re doing it?

 

Ms. Yee:  No, the examples I showed represent government, private industry, and utilities. Because we wanted to keep it generic, we’re doing it by industry right now. Hopefully we’ll satisfy those instances.

 

Mr. Mike Todd:  It sounds like you’re only interested in the ones being used.

 

Ms. Yee:  Correct.

 

Mr. Royal:  The strongest vendors say, “If you do business with me, you do it this way.” Maybe the government will do that. I hope not. That’s often how you establish the common set—the biggest bullies on the block.

 

Mr. Todd:  I look at it as, “If this is our need, this is how we establish the relationship when we do business with someone else.” There’s a negligible difference, so we’ll have separate invoices. We don’t need 50 types. It comes down to building the one that works for us. If we can borrow from another, it’s smart to do so, rather than reinvent the wheel.

 

Mr. Royal:  Every time you do business with a new party, you do work to establish the relationship, rather than, if you have a standard that everyone uses, the machine establishes it. That’s what ebXML does. The machines establish it rather than manned intervention.

 

Ms. Yee:  Even if we use ebXML, there has to be a human saying, “We use ebXML.”

 

Mr. Royal:  It’s in the trading partner profile.

 

Mr. Dyung Le:  You’ve described the mechanics of using the template, then generating the schema. Do you envision the opposite, where you start with the schema, then create the template?

 

Ms. Yee:  No, I hadn’t. Do you think there’ll be the need to go the opposite way?

 

Mr. Le:  I don’t know. People modify things based on need.

 

Mr. Royal:  There are some implementers looking at UBL and spreadsheets used to generate. They want to be able to reverse engineer to a tool they can use, so there is interest.

 

Mr. Todd:  The business architecture models—I’d like to see a toolset that allows one to take the data and the relationships and lay out the process and all the intermediate steps. We go through all this work on what to do, how to do it, formats, etc. We have to link down to the structure and content of what we exchange on the other end.

 

Mr. Royal:  That’s being done at UN/CEFACT. Define your model, push a button, and have the schema developed. That’s their goal.

 

Slide 18  [CICA Schema]:  The CICA schema is very modular. It has consistent use of layers and modules. Once the document is selected, the schema is automatically built.

 

Slide 19  [FEA and CICA]:  Owen had asked whether this ties in with our FEA. Yes, it does. Where does it? With the Data Reference Model. CICA supports program and business line operations, and data exchange between the government and its customers and partners. It also categorizes information by content and decomposes it into greater detail. That’s what DRA addresses.

 

Unidentified participant:  I thought the Data Reference Model was not ready yet.

 

Ms. Yee:  It’s out there on the Web. I pulled it up. I’m not saying it’s incorporated into every federal agency, but it’s out there.

 

Mr. Todd:  I think you’re both right. I think there’s a version published, but it’s still in development.

 

Mr. Royal:  It’s still in development. It’s not published. There may have been some early work circulated.

 

Unidentified participant:  Can we have access?

 

Ms. Yee:  I got it on the Internet. I’d be happy to give you the URL.

 

Unidentified participant:  I got it from a link from XML.gov.

 

Mr. Roy Morgan:  It wasn’t the FEA-PMO website, was it?

 

Mr. Ambur:  On the XML.gov home page, in the “What’s important” section, there’s a link to the FEA site. When you go there, you will see links to each reference model. The Data Reference Model has not been made publicly available yet.

 

Ms. Yee:  If you want to see what I saw, go to XML.gov

 

Mr. Williams:  They’re saying it’s in line with the goals of the FEA?

 

Ms. Yee:  Yes, thank you.

 

Slide 20  [Where Are We Now?]:  Where are we now? We’re going through and approving the CICA architecture and design rules in the X12 process. We’re also refining the invoices you saw based on the above. We’ve been working on the structure and design rules for two years. We started building the invoices based on the design rules. We said, “Hey this works OK, but it’s not great,” so we went back and scrubbed it to make sure it does work as it’s designed to. I’m happy to see it’s not being done in a vacuum.

 

Mr. Royal:  Help me go through the process X12 goes through to define a specification. How does the public come in? Take me through the X12 process.

 

Ms. Yee:  X12 creates public standards, free for anyone to use. Briefly—for any X12 standards development—each new request must be presented by one of the work groups of a subcommittee. It is then formally recognized and reviewed & discussed within the other subcommittees. The standard then goes to the Technical Assessment Subcommittee, to ensure that the standard is syntactically correct. Once the new requirement to the standard passes the subcommittees, it is put out to the full X12 membership. It also goes through a process review board to ensure that all the requirements according to the rules of X12 have been met. Once the standard is adopted by X12, the public is able to use it. Alan, is that it?

 

Mr. Kotok:  Yes. The ANSI connection takes the draft product that’s completed by X12 and publishes it for public review, but the meetings where the standards are discussed are open meetings, and anyone can attend and participate. You don’t have to wait until it’s published as a draft. The public comment is accepted during development.

 

Unidentified participant:  Can you describe how to change a XML document after it’s released and into usage?

 

Ms. Yee:  Whether introducing a new standard or changing an existing standard, the X12 process is the same. Since we are still developing the design rules and streamlining the architecture, I can’t say what will change in the structure.

 

Same participant:  Can you theorize on the process?

 

Ms. Yee:  In terms of overall approval, if there’s a schema that needs more functionality or data requirements, it would go before the Finance subcommittee, and the committee would look at it to ensure it follows the rules for an invoice. Then it goes to technical assessment, to ensure it is syntactically or technically sound according to the standards rules, it then goes to the PRB. And, as Alan said, it’s reviewed in the X12 arena, then it goes to a vote.

 

Mr. Hsu:  Is it open in the sense of not being limited to North American trade?

 

Ms. Yee:  Yes, certainly. Anyone, anywhere will be able to use it.

 

Mr. Williams:  Have you started thinking about Web Services yet? Of course, EDI is a request/response paradigm.

 

Ms. Yee:  Yes, definitely. This is one of the major advantages of XML. 

 

Mr. Royal:  I didn’t notice in the architecture where the choreography is or the business rules so the process fits. Where does that come in? Like if I send a request, what amount of time does my system wait for a reply, where does it fall out, etc.?

 

Ms. Yee:  How individual systems process the data is a matter of each system. We’re not designing a standard that will work and not focusing on the amount of time to process a document.  We’re not there yet.

 

Mr. Kotok:  We’re talking about the message instance here. What you talked about earlier—the top-down versus bottom up, the business rules, the choreography that would define a lot of the pieces that go into the model, and the idea with this model is to have the lower-level interchangeable part that you can plug in as needed to reflect the business process.

 

Mr. Royal:  It means building all the pieces of the car except the engine.

 

Ms. Yee:  We want to equate it to the engine.

 

Mr. Royal:  The engine is the business rules. By the way, UBL is in a similar state. Trying to bring in business rules is very different. Trying to bring them into a schema is not easy.

 

Ms. Yee:  They may be working on it, but in what I’m working on you don’t see it, do you Alan?

 

Mr. Kotok:  This is where interaction between the X12 and UN/CEFACT people is vital, because these are the pieces that have to plug into the business processes. Fortunately, there’s a good deal of overlap between X12 and UN/CEFACT committees, so it is certainly the objective to fit into the ebXML structure. I think where they’ve been focusing so far is on the match up of Core Components, more than the higher level business processes. You have to start somewhere.

 

Mr. Royal:  I agree, but I’ll also point out that at various layers of your model, you’re going to have interaction between reusable components. You need to have a way of determining what part of the business process you’re in. You start to get associations, like, “If I’m a party as a buyer, then I must have these things associated with it.” My role may change. If I’m a buyer or reseller, I might have more constraints or extensions of my
BIE. Given the state of where we are in the transaction, my requirements might change. That has to be built in not only at the top, but in the layers.

 

I’d really like to thank you, Theresa, for your presentation. I’d never seen the layers of the architecture, so it’s enlightenment for me. I’m looking forward to working with you in the future, so keep up the good work.

 

We’re not far off the schedule, so as David prepares for his piece, there’s an IPM at noon for namespace management. We’d like to flip it around. Now David Heiser from the IRS is here to talk about graphics and Section 508.

 

 

Mr. David Heiser

IRS

Managing Accessibility in an XML Instance

 

Mr. Heiser:  First I’ll give you a little background. I’m David Heiser with the IRS’s Tax Forms and Publications Divison. My job is to manage our electronic offerings for certain types of documents you’d like to forget about today. My immediate office is responsible for creating and publishing certain forms.

 

Mr. Royal:  A long time ago, I did some work for the IRS on their computer systems, so I have an appreciation for what you’re going through now.

 

Mr. Heiser:  I’m on the document side of the house. I don’t do EDI, and we’re not into schemas. I work with documents published in various formats including paper, so our requirements are different from groups pointing to Web Services and online government. The bottom line is that [Section 508] compliance addresses the lowest line denominator, because a lot of people don’t have a computer or online access to do their taxes.

 

Slide 2  [Problem Definition]:  Our problem was, do we create documents, created for paper and Web publications and make different versions? Section 508 and accessibility came up, so I addressed those issues. It affects all government agencies, and how we approach it from our side of the house.

 

Slide 3  [A Graphical Black Hole]:  You have a document. With a piece of paper, if you’re looking at it, you can digest all the information. For certain of our customers, the information in the graphic is lost. There are ways around it. You can describe what is in the graphic. They came up with rules for Web pages. The bottom line is, text readers don’t see graphics. We had to solve it for the information we’re putting out.

 

Slide 4  [Legislative Mandates]:  Here’s a brief review of the legislation:

·         Section 5098 of the Rehabilitation Amendments Act of 1998 requires government to provide access to information for persons with disabilities that is comparable to access to others.

·         The Americans with Disabilities Act prohibits discrimination and requires equal access to services for persons with disabilities.

·         The E-Government Act of 2002 requires agencies to make their websites more efficient and usable using XML.

The third one in fact encourages the use of XML. I think we’d have gone there anyway, but it encourages us. The biggest problem is that this is all unfunded.

 

Slide 5  [IRS Perspective - Some Issues]:  From our perspective, the most important issue is, my authors have a complex job as is, and we need to minimize the complexity. We did address [Section] 508 in the past. We translated all the files into Braille and large print. That’s post-production, which, in the long run, is more costly. There’s also maintenance of descriptions. Then you have a management issue of little files running around a Web page. We didn’t want to deal with that issue. We did al lot of research over the last several months.

 

Slide 6  [What the Standards Revealed]:  The bottom line was that the recommendations we saw dealt with Web pages, not the front-end authoring side. I have two objectives: paper print, and also Web. We know that XML and SGML can be defined for just about anything, but they don’t specifically define it for…we looked at SVG [Scalable Vector Graphics]. It doesn’t really address our issues. It’s too new, and the IRS doesn’t jump into new technology because we like to look at how it plays out.

 

We also looked at MathML because it deals with some issues on getting equations into XML and SGML documents. The problem is, it requires a plug in that’s not freely accessible to the whole world. It added complexity that we didn’t want to impose on our customers. Last bullet—we also examined some proprietary stuff, but we wanted it as open as possible. We looked at JAWS [Job Access With Speech screen reader] and others.

 

The problem with all these is, they’re not all at the same level. They don’t all handle PDF or HTML files the same way. Each presented a problem we’re trying to avoid.

 

Slide 7  [Enlightenment]:  My real inspiration came from the National Braille Association. An accessible graphic is written from an aural perspective. It really is a flowchart. You have start and end points, with a decision point in the middle. From an aural perspective, if you’re describing a flowchart, you want a start and an end, not a decision tree—because you forget where you are.

 

Based on this theory, we’re proposing to train our authors to, when dealing with descriptions, not write them from scratch, but maintain them and read them aloud to each other to make sure they get the points across and take it from start to finish without confusing them.

 

Slide 8  [Summary of Accessibility Requirements]:  Graphics use “Alt” tags for small graphics and Description files for information objects. Tables use summaries and/or scooping. We’re still debating whether headers or access information is better for some of our customers. We decided a combination is probably the better way to go.

 

Slide 9  [Potential Issues]:  Issues—graphic descriptions are too long for Alt tags. We wanted to package our files. The solution had to be easily transformed to HTML. The bottom line was, I didn’t want HTML Desc files. I wanted to package everything.

 

 Slide 10  [IRS Structure Solution]:  Another solution—we had about 3,000 graphics and converted them to tables. We revise documents; we don’t create many new ones, so we came up with a descriptor element. It’s a container subset that’s modular—plug-and-play.

 

Slide 11  [IRS Solution (continued)]:  What we do is wrap the graphic in a container, then we provide a new description element to locate the graphic when it’s converted to HTML so it falls in the right place, and then we provide navigational linking points for the text reader.

 

Slide 12  [DTD Subset]:  This is so simple—this is it in a nutshell. We created a subset that works for any XML, SGML, or DTD, and a title and paragraphs and lists. We felt the need to bring in a table. That’s where the issue came up. We have a form that’s illustrated. If there’s not a narrative within the form itself, we have to provide a descriptor, so that’s where we came up with the table.

 

Slide 13  [Two Table Models]:  We needed to revise DTDs to support two table models. That’s not terribly difficult, but we also had to support the software we had. Because of the amount of money invested in the software, we’re not going to procure other software. We used a CALS table for regular production and print, and an HTML table model for the descriptor subset.

 

The composition output ignores the descriptor, while the XSL stylesheet can put it in a browser window. We’re working with Accenture to come up with a solution because the table’s too big to fit into the space provided, so we’re proposing to populate graphics into a separate expandable window, where you can control the size.

 

Slide 14  [HTML Table Illustration]:  This is an illustration of what I just talked about. So we use a CGI script that’s imported to determine the behavior of the table.

 

Slide 15  [Markup Illustration With Trickery]:  Until we solve the problem of bringing two table models into the editor, we’re using table templates. We’re working with the vendor to fix this. We’re working with SGML to use the “ignore” subset. Of course, we can use exclusion, because we don’t want an entire table subset for access reasons.

 

Ms. Harvey:  Why not expand the CALS table model?

 

Mr. Heiser:  Because it’s part of the conversion issues. I’m not sure why.

 

Ms. Harvey:  As long as you modify both of the group models, the CALS model would support it.

 

Mr. Heiser:  Yes—we have to go back and look at it.

 

Slide 16  [IRS Application Solution]:  The other thing we had to do is customize the software. [Mr. Heiser demonstrated some of the customization tools.]

 

Slide 17  [Descriptor Management]:  The tool collects the descriptors, and gives them the ability to manage it. It pops up in a separate window with its own menu.

 

Slide 18  [Next Steps]:  The next step is to fine-tune the multiple table model issues. We have a contract for writing and tagging the descriptor content. We need to train the authors. That’s the hardest issue of all. Then we need to add CGI code to display the descriptor in its own window.

 

Mr. Hsu:  Are the authors all internal users?

 

Mr. Heiser:  No. We’re actually working on a solution that affects all document authoring in the IRS. There are about 1,200 users around the country. It’s not all internal. 

 

Slides 19 & 20  [Some References; Acknowledgements]:  I had a lot of help from a lot of people. If you’re interested, you can contact me. We’re open to sharing both the software and the customization. The DTD subset, once finalized, will be freely sharable. We’re asking for input, feedback, comments…are we off base? It seems like a simple solution to what at one point was considered to be a difficult problem.

 

Mr. Band Niemann, Jr.:  Have you done any live testing with 508 access?

 

Mr. Heiser:  Yes. The Alternative Media Center has users we’re testing with. So far, the reaction is positive. We’re also running it through JAWS software, but it’s not our only test suite. JAWS is expensive; with Windows NT it’s almost $2,000 dollars. It’s not the right fit because it’s not the lowest common denominator, and we’re not looking at a proprietary solution.

 

Mr. Byrne:  So for the author’s descriptor, that was a new solution?

 

Mr. Heiser:  Yes. It wasn’t rocket science, but once it was there, it worked.

 

Mr. Royal:  Would you be comfortable if someone called your descriptor a header?

 

Mr. Heiser:  No. Why?

 

Mr. Royal:  There’s work under way with headers. Essentially, what you described is parallel. There may be other metadata on what you described that encounters those segments.

 

Mr. Heiser:  We could rename it.

 

Mr. Royal:  I’m not asking you to.

 

Mr. Niemann:  How soon do you go live with actual data?

 

Mr. Heiser:  We’re looking at September.

 

Unidentified participant:  One way of getting authors to develop descriptions is to develop tables over the telephone.

 

Mr. Heiser:  It really is a different mindset.

 

Mr. Royal:  You could use annotation. The document in the XML seems to be going more toward the XHTML movement.

 

Mr. Heiser:  Yes. Some descriptors are very lengthy. It’s bad for the reader and the author. We wanted a structure to manage and present it better.

 

End presentation.

 

 

 

Mr. Royal:  Where we stand with namespaces is, LMI made a recommendation, and I put it out as a draft to the XML Working Group. We received very few comments. Most of the people whom I consider experts said, “Yes, it makes sense.” In the last week and a half, we got some pushback from the W3C. It seems that they have some competing standards. The standard for namespace says that one is not to assume that the namespace declaration is a resolvable entity. For that reason, we felt that URNs would be the better approach for a namespace for the U.S. Government. We hedged our bet a little, because the structure would be resolved by an application. Then we found that the W3C is working on RDDL. As I understand it, they want a common directory, where they can put attributes, and refer to the attributes from the XML document. The purpose is defined as this: a machine could go to the directory, find the attribute, and find out about the thing we’re talking about. It leads to the idea of the Semantic Web. I want to get more information from them. One W3C standard says it shouldn’t be resolvable, and one says it should. I just want some more input. If you have some way you can provide input…

 

Mr. Heiser:  Did you hear from CSC?

 

Mr. Royal:  No.

 

Mr. Heiser:  We may have some input for you.

 

Mr. Royal:  I welcome it. It may come down to “If you use a URN, this is the way; if you use a URL, this is the way.”

 

Ms. Harvey:  On the XML Developer’s listserve, that document has been in conversation in the last two weeks. Most of it has been very positive.

 

Mr. Royal:  I’d like to see that thread.

 

Mr. D.J. Atkinson (via telecon):  I’ve been happy to see the recommendation come out. One comment is that it doesn’t deal with interagency collaboration very well. It’s geared toward the Executive Branch of the federal government.

 

Mr. Royal:  My answer to that is lengthy. The short answer is that for collaboration between agencies, all they have to do is register at “.gov.” They only have to have a letter from the CIO of the agency to do so. It also provided state and local governments the opportunity, but we already have that under the “.us” domain.

 

Mr. Atkinson:  The way it’s laid out, it doesn’t allow for it well.

 

Mr. Royal:  I don’t agree. Why don’t you think so?

 

Mr. Atkinson:  Because it’s based on the lead agency. What if you don’t have a lead agency?

 

Mr. Royal:  It doesn’t matter when you consider that the reason is to have a unique name. The semantics are not really related.

 

Mr. Atkinson:  It’s not technical, but political.

 

Mr. Royal:  The political issue is, “Who is allowed to register at .gov? What other agencies are able to register—‘money.gov?’” So it goes back to the challenge being making sure it happens with the DNS rather than the URN.

 

Mr. Ambur:  We’re planning in next month’s agenda to focus on UBL and ebXML updates. I wonder whether this merits further discussion on that agenda and whether we need someone to talk about RDDL?

 

Mr. Royal:  Tim Bray is the author of a white paper, as is Jonathan Borden of Open Healthcare Group.

 

[Editor’s note: Mr. Bray is scheduled to brief us on RDDL at our May 21 meeting:  http://xml.gov/agenda/20030521.htm]

 

Mr. Morgan:  There’s a Registry Team meeting this afternoon. We’ll be talking about name and naming convention management.

 

Mr. Royal:  Is there any other business or updates? Then I thank you all for your time and attention. Thank you to the speakers for sharing your information with us, and thank you to the people on the phone.

 

 

End meeting.

 

Attendees:

 

Last Name

First Name

Organization

Ambur

Owen

FWS

Barr

Annie

GSA

Bellack

Dena

LMI

Brinson

Latina

State

Byrne

Tony

CMS Watch

Dodd

John

CSC

Ellis

Lee

GSA

Harvey

Betty

ECC

Heiser

David

IRS

Henry

Larry

CSC

House

Robert

State

Hsu

Francis

State

Judge

John

State

Kane

John

NARA

Kotok

Alan

DISA

Le

Dyung

NARA

Morgan

Roy

NIST

Niemann

Brand, Jr.

Tax Analysts

Paik

Young

State

Royal

Marion

GSA

Todd

Mike

OSD

Weber

Lisa

NARA

Williams

Kevin

BlueOxide

Yee

Theresa

LMI

Zuech

Al

VA