Federal CIO Council

XML Working Group

 

Wednesday, July 23, 2003 Meeting Minutes

 

State Plaza Hotel

2117 E Street, N.W

Washington DC 20037

 

Please send all comments or corrections to these minutes to Glenn Little at glittle@lmi.org.

 

Mr. Owen Ambur:  Shall we get started? First, I’d like to thank Annie Barr and Lee Ellis for our new accommodations. I trust our non-government folks appreciate the ease of getting into the building, compared to GSA. They’re certainly nice facilities. Hopefully they’re the right size, and so forth.

 

I have a couple of announcements: First, I’ve been contacted by a fellow representing OASIS about serving as the content manager for the OASIS EGov focus area. I’m certainly inclined to take him up on his offer in some fashion. The fashion I prefer is by syndicating content between the XML.gov and XML.org site, not by duplicating the same content. I just wanted to mention that in the form of a “heads-up” for now. I’m looking for advice on how best to do that. Second—I don’t see Ken Sall here this morning; he’s been one of our most active participants. He’s also working on EGov projects under contract to GSA. He made a proposal that’s been accepted for the XML 2003 conference [XML Conference & Exposition 2003, http://www.xmlconference.org/xmlusa/] for December to talk about EGov’s activities involving XML. The third thing is, Betty Harvey of the Electronic Commerce Connection [http://www.eccnet.com/], (who heads up the DC XML User Group) and I have been talking about a day-long forum for tool vendors to come in and talk about how they’re tying to make it as easy as possible for folks to author and edit well-formed and valid XML instance documents. We’re thinking in terms of a September time frame. We’ll see how it works out. With that, at these meetings we usually start by introducing ourselves and telling a bit about our interest in XML. I’m Owen Ambur, co-chair of this Working Group. I understand Marion Royal of GSA, our other co-chair, may be joining us later.

 

[Introductions]

 

Mr. Owen Ambur:  Alright, let’s get started.

 


Brett Stein, XAware, Inc.

Iqbal Talib, i411, Inc.

XML, XFML, and the Other “Xs”—From Data Aggregation to Faceted Search-and-Discovery to Business Value

 

[Editor’s note: For consistency’s sake, slides are referenced by numerical slide number within the overall presentation, rather than by the number appearing in the title area for each.]

 

Mr. Brett Stein:  Good morning. Welcome to today’s joint presentation with i411 [i411 Inc., http://www.i411.com/] on XML, XFML, and the Other “Xs”—From Data Aggregation to Faceted Search-and-Discovery to Business Value. We’re going to be describing these Xs as vehicles for faceted search-and-discovery, then drive into demonstrations using the XML data as a case study on how to use these studies as a use case. I’m Brett Stein, and I’d like to introduce Iqbal Talib, who’s doing the second part of the presentation.

 

Slide 2  [ Road Map]:  As a means for setting the agenda for our presentation, I’ll use this as a road map. I’ll start by discussing the objectives and rationale for this presentation, moving into a rapid discussion of EII [Enterprise Information Integration], specifically the XML-driven approach to EII, then move into concepts related to faceted search-and-discovery of data.

 

There are two focuses: one is on the formation of an XML file from an [Office of Management and Budget (OMB), http://www.whitehous.gov/omb/] Exhibit 53 spreadsheet; then the other is search and discovery using i411 technology.

 

Slide 3  [Objectives (what? bottom line?)]:  Our first objective is to demystify some of the concepts that are germane to XML as it relates to data integration, as well as XML as it relates to aggregation of data, as well as some other Xs on the screen. Then I’ll do a couple demonstrations. The first demo will be showing the real power of creating unified XML views of disparate, dispersed, or unstructured data, in interoperable XML across agencies. Then Iqbal will tie it all together in a second demo. Together, we will show the benefits from the federal government perspective, and from the perspective of other agencies. That is, how do you unlock the social, economic, and intelligence value of information assets of a government Agency? How do you bring them together, and what do you get out of it?

 

Slide 4  [Rationale (basis for joint presentation?)]:  That said, there’s a vital and urgent need for data integration and improved search and discovery among federal agencies, to provide innovation as a means of providing “better, faster mousetraps”–to be able to attain technology and business goals in government. XML plays a key role as the underlying data exchange for faceted searching, and XFML provides for the sharing of hierarchical faceted metadata and indexing, so from the purpose perspective, this presentation serves as education and outreach in using demonstrations to drive home the point of using concepts of XML-related technologies, in real-use cases, and to see their benefits.

 

Mr. Ambur:  I’d like to mention that Ken Sall has joined us. Ken—I mentioned that you’re going to speak at XML 2003 on the EGov aspects of XML.

 

Slide 5  [Context (big picture? external forces at play?)]:  If you visualize the federal government by using this diagram, you could look and certainly spend all your time on the bubbles within the diagram. You could probably spend a whole career in government focusing on the bubbles. I’ll just spend a couple minutes. For example, in the context of Homeland Security, in 2001 [the Department of] Homeland Security was formed to bring together 22 agencies that were independent silos of information. There was a great need to bring their information together—to share and integrate all the information together. If you think about the Data Reference Model [DRM—Federal Enterprise Architecture (FEA) Data Reference Model, http://www.feapmo.gov/], about the DRM and E-records management, there are a number of agencies that have not thought a whole lot about sharing data across agencies before.

 

Enterprise Resource Management [ERM] is one of the 24 EGov programs defined by the administration, so there’s a lot of visibility on the sharing of information. The thing about the FEA is, there’s a lot of activity to enable the government to provide citizen-centered results-oriented sharing of information.

 

So with business models and sharing, in the context of the FEA, the foundation must be services for sharing and integration of information across different agencies.

 

Slide 6  [Impetus (example issues?]:  So it’s no secret that there are great challenges in sharing and integration of information. The first challenge is redundancy across agencies. We can see some discussion from OMB. For instance, in terms of the 28 lines of business across the federal government, 19 are redundant. Then, there is a lot of expenditure on ongoing maintenance of redundant systems. If you couple that with industry analyst studies of development costs to build applications, estimates say that about 20% of the government’s proposed Fiscal Year 2004 IT expenditures is for redundant systems. There are a lot of issues related to that. Think about typhoons of data, silos of information, stovepiped information. It all ties back to interoperability and getting systems to talk to each other and share information.

 

Slide 7  [XML—and the Other Xs” (brief definitions?)]:  That leads us into XML technologies. XML and XFML: XML is a standard, simple, self-describing way of encoding text and data. It’s a standard way for communicating, it’s human- and machine-readable, and it allows you to apply context to information.

 

XFML is a model to express topics, then definitions—so what XFML is all about is, to describe the indexing efforts in information. We’ll show a demo in a while; it’s a huge spreadsheet of information. XFML allows us to put it in context and make it presentable.

Working through this list, you can see other relevant technologies: XTM [XML Topic Maps, http://www.topicmaps.org/], from which XFML is derived; XQuery [XML Query, http://www.w3.org/XML/Query], which allows indexing into structure; then XSLT [XSL Transformations, http://www.w3.org/TR/xslt], which transforms XML documents into other presentations. ebXML [Electronic Business using eXtensible Markup Language, http://www.ebxml.org/], is an exchange standard for e-business—so we see a lot of technologies coming to the forefront to allow us to attack these problems.

 

Mr. Ambur:  Brett, you make a point that’s of interest to me. You said XFML is derived from XTM. Will you be saying more about it, Iqbal?

 

Mr. Iqbal Talib:  Yes, I’ll discuss it a little.

 

Mr. Stein:  XFML is a subset. XFML is specialized, and XTM is specialized.

 

Mr. Joel Patterson:  Who is running the standards ?

 

Mr. Stein:  XFML.org has a lot of information about the standard.

 

Slide 8  [Standards-Driven Enterprise Info. Integration]:  Now you have the basis for a discussion on standards-driven Enterprise Information Integration. First, it provides a virtual view into multiple databases, to make a single view of customer information and other relevant information. The application can access the view as if it’s stored in a single data source, even if the data come from many sources that are geographically dispersed. When it’s accessed, it transparently handles the connectivity into back-end data sources and other functions (for example, security, integrity, query optimization). All these things could be from a single source. They could also be spread across multiple sources.

 

 These views do a couple things for you: they leverage existing Information Technology (IT) expenditure in infrastructure; they leverage applications and other IT investments that have already been made. They also provide a path for migration from legacy systems to new-world technologies, such as client/server and Web applications. They give you a path to intelligently migrate from one to another.

 

Slide 9  [Single View of Information]:  So with the notion of a single view, these on-demand views apply to all levels and functions within the government, integrating information from cities, counties,  states, and federal agencies for a single view. The basis of Homeland Security is rapid sharing of information on a first-response basis. To be effective, information must be shared at all levels.

 

Mr. Ambur:  On this slide—I mentioned earlier that OASIS had invited me to contribute content to their E-government focus area. You’re talking here about an international proposition. Obviously, this can extend internationally as well.

 

Mr. Stein:  Absolutely. The next logical step is outside our borders.

 

Mr. Ambur:  I suggested to the OASIS e-Gov TC folks that a high-value focus would be on one, or two, or a small number of forms with international use. My concern is that they remain focused on having a philosophical discussion rather than on creating something of real value. To the extent that anyone has any influence on their process, I throw this out for your consideration.

 

Slide 10  [Solving the Integration Challenge]:  Let’s take a step back for a moment. As means of solving integration challenges, what else is out there? The first thing is, build custom applications. This has been the historical approach. We can talk about the baggage with this approach. You need skill sets in-house to build the application; you’re talking about time to market; time to connect point-to-point; then ongoing maintenance to keep it up and running. At the other end of the spectrum are applications that do enterprise application integration. Here, it’s a different ballgame in terms of the baggage that goes along with it: extremely complex, expensive, heavyweight platforms; tremendous complexity for installation and maintenance. Also, in the government space, the cost is typically outside the budgets that are allocated to solve these problems. Again, in the analyst community, the estimate is that it would be necessary to bring together hundreds of applications to break even on an application. It’s hard to justify getting agencies to pool money to invest, and it doesn’t happen easily.

 

Let’s go back to the enterprise integration platform. We’re talking about a lightweight, componentized application. It would allow a positive ROI [Return On Investment] on the first or second implementation you do. It’s no longer stovepiped. We talked earlier about aggregating information; it’s the basis of an EII platform. It allows us to deploy on existing infrastructure. Finally, it involves some sort of visual development environment that makes these easy to use. So reduced cost, time-to-market, and complexity make it easier to use.

 

Mr. Ambur:  Is it component-based?

 

Mr. Stein:  Yes, component-based.

 

Mr. Ken Gill:  Can you give us examples, to make it easier?

 

Mr. Stein:  We’re XAware. There are others out there—competitors, like MetaMatrix [http://www.metamatrix.com/], to Data Junction [http://www.datajunction.com/], and others. IBM has a product as well. What you find in the products from the bigger players is, you lose some independence. For example, IBM’s has to run on IBM’s platform.

 

Slide 11  [Applications]:  So let’s talk about the makeup and structure of a typical EII platform as it relates to solving these problems. In the context of this picture, applications need to come in with a data source. We see myriad solutions. Typically, they’re stovepiped. With a typical EII platform, you get a couple things: some sort of server or engine allows you to invoke real-time use of information. I’m talking about a contract between partners to share information based on an XML schema or instance that we’ve agreed upon on how to share. It can be deployed on the Web, or use application servers, or be embedded natively into applications.

 

To communicate with servers, application interfaces are provided to allow communication. It can be Web-based, like SOAP; it can be Java-based, or EJB, JMS, CGI, or others for communicating with the server. On the back end, you need adapters to communicate with data sources. There are myriad sources—like databases, enterprise systems, Web Services, Java files, email, FTP, hardware, etc. There are a number of ways to bring data into this consolidated environment to share. There’s also a number of ways to let you see your metadata and map to XML structures.

 

Slide 12  [Information On-Demand]:  One slide about XAware: it goes into what XAware provides from a technical perspective. It brings several unique features into creating a virtual view of information, to connect information to data sources. It uses data chaining, which allows us to share information across systems. It’s bi-directional, so you can deal with inbound XML and decompose it as well. So to share, I need to get information into data sources, to break it up into its native format. You can think about synchronization or migration of data sources; as you migrate, there’s a new path to put in place. With our technology, you can communicate from multiple sources to multiple targets. Finally, it offers the ability to transform data, using our scripting, or tying into JavaScript to transform the data. So you and I share; you expect a different structure. It can do those transformations on the fly. So it’s a unique, standards-based approach to solving the problem.

 

Slide 13  [Using XML for Federal Enterprise Architecture]:  So that brings us to the FEA. Here I’ll discuss how EII fits into this. Let’s take it from a model perspective, on how the FEA is broken down. It starts with the Business Reference Model [BRM, http://feapmo.gov/feaBrm2.asp], which is a way of describing business functions of the government, independent of agencies. So a lot of things about sharing both ways, many-to-many relationships, etc. apply. Then there’s the Data Reference Model [DRM, http://feapmo.gov/feaDrm.asp], where you can talk about a standards-based approach for cross-Agency exchanges of information in real-time, leveraging your IT investments.

 

Mr. Ambur:  With respect to the DRM, the FEA Program Management Office is now focusing on the data model underlying the Business Reference Model (BRM). It’s the last of the models to be issued. They’re seeking input on how best to put the DRM together. They’re looking at an October 1 deadline for issuing it. At this point, it’s limited to internal review in government agencies, but if you have any suggestions in that regard, I’ll be happy to have them.

 

Slide 14  [Using XML for FEA (cont’d)]:  Moving on—the duplication of effort that the FEA is intended to resolve—we talked about it earlier in terms of redundancy. The EII platform allows many agencies to appear as a single source of data. Finally, we talked about the Service Component Reference Model [SRM, http://feapmo.gov/feaSrm2.asp]. We saw data management and data integration challenges as defined within the back-office services domain, so this resolves the very problems the FEA is addressing.

 

Slide 15  [Integrated Information Sharing]:  So this shows us a single XML view across all levels and functions of government, in horizontal and vertical aspects.

 

Slide 16  [XAware: Example Applications]:  Here are three examples of where we’ve worked. We’re fortunate to have Dennis Burling on the phone. He did a lot of work in Nebraska, helping on an exchange network of the EPA [Environmental Protection Agency, http://www.epa.gov/]. They had a series of Web Services for sharing information between states and EPA network nodes, so using an EII platform, with XAware, we were able to quickly solve the need to put a node in place to share information across agencies with Web Services technologies. You can see some other projects we’ve done; for the Bureau of Land Management, we did work with oil and gas wells. It allows inspectors to enter information into their tablets, which is then fed back to their agencies.

 

Mr. Ambur:  Dennis, this is Owen. Can you hear the presentation?

 

Mr. Dennis Burling:  Yes.

 

Mr. Ambur:  Has there been any thought in the Nebraska State government to apply the technology horizontally across the government, rather than just vertically with the EPA?

 

Mr. Burling:  We’re getting ready to step into that piece. The drinking water project with EPA was done with HHS [Department of Health and Human Services, http://www.hhs.gov/]. We’re getting ready with XAware to install a node there to get their drinking water information and give them some of our information.

 

Mr. Ambur:  That might prove to be a worthwhile presentation for this Working Group, so it’s something to keep in mind.

 

Mr. Burling:  Another piece is the State’s counterpart to FEMA [Federal Emergency Management Agency, http://www.fema.gov/],to have our information available in XML format for the emergency management group to use any time they want.

 

Mr. Ambur:  That sounds good. Please keep us in mind for reporting on any successes or failures you might experience.

 

Slide 17 [Context for Demos 1 + 2: OMB and the Federal Budget]: Let me set the context for the demonstration. The demo deals with OMB Exhibit 53 budget data for IT, which is a subset of OMB Exhibit 300.

 

Slide 18  [Context for Demos 1 + 2: The Federal Budget Process]:  [Editor’s note: The demo is not included as part of the PowerPoint slide presentation.]  Specifically, Exhibit 53 deals with budget information for fiscal years 02, 03, and 04 for IT projects across agencies. As most of you are aware, these budgets are year-long multi-step processes, performed by every office of every Agency, developed for every fiscal year. It’s a lot of work, starting at the lowest level and moving up to the highest level.

 

Slide 19  [Demo 1 + Demo 2: Flowchart of Steps]:  Let’s talk about today’s demos. It’s a joint demonstration, starting with an Exhibit 53 spreadsheet. Using an XAware EII platform, we extract information, and convert it into a spreadsheet for presentation to an XFML format. It hands over that format to the i411 software for faceted searching.

 

Slide 20  [Demo 1: Conversion of OMB Exhibit 53 to XFML]:  So let’s get into what we’re doing. Exhibit 53 is an [Microsoft] Excel spreadsheet containing IT investment details for all projects in every Agency of the federal government. The demo is going to use our XA-Suite to convert the spreadsheet to XFML. I’ll talk a little about that representation. For those of you who know about XFML, it’s a hierarchical structure that gets you to a page, which contains occurrence of topics defined within. Each instruction in our XFML file is a page. Each has a title, which uniquely identifies a row in the spreadsheet. Each page has occurrences, pegged to a topic. Each topic relates to a facet. There are five facets: Department, Investment Type, Budget Entry Year, Project Type, and Investment. So you can slice-and-dice across the facets. Later, we’ll see how to search and index to search.

 

Slide 21  [Demo 1: Conversion of OMB Exhibit 52 to XFML (continued)]:  Within facets, there are a number of topics. Under Investment Type, you have Projects by Mission Area, Office Automation and Infrastructure, etc. Under Budget Entry Year, you have 2000, 2001, 2002, and 2003. The “24” represents the EGov initiatives. Then you set up dollar amount ranges for each year within the budget.

 

I’m going to jump back and forth from the presentation to the demo. First I’ll bring up the spreadsheet. So I look at IT Investment details; the information is laid out by Investment Type; Unique ID; Type of Project; Description; then Budget Information. You can see in Project ID that there’s some encoding going on. As defined by Mr. McVey, the first three digits define the Agency, so that’s how we get Agency information to the topic.

 

We have to work through it and do the transformation to get into topics we need in the XML spreadsheet, so we bring up the XAware designer—our visual development environment. We’ll look in this format first. It creates an XML file, which is a template for sharing of information. This is where we populate the information from the spreadsheet. Each row is a page. There’s a set of the facets, then topics form facets. This is where we set up the index for metadata from below.

 

At the bottom is a page. It contains a title, then different occurrences. It contains topics for page or for rows of the spreadsheet. Given these topics, we go back to the development environment and open the XML file. It’s brought into our environment. I’m going to save this document as a metafile that XAware calls a business document. It’s a schema or static instance, that converts it to a format for our server. I’m not going to convert the entire one; just one particular Project ID. Add into the document an input parameter, and it allows you to input a Project ID from the spreadsheet.

 

Now I’m going to go down to that page element. This is where we do the mapping into our Exhibit 53 spreadsheet. It creates a business component that allows us to connect data sources, and map to XML structures, as well as any other mapping. In this case, we deal with a spreadsheet, and treat it as an ODBC [Open Database Connectivity] source.

 

We need to define the name of the spreadsheet; in this case, it’s IT Spending (IT). We define an input parameter, which is Project ID. I use it in this component, then I go into the spreadsheet, so you see all the columns in the spreadsheet. You can then select out these columns that you want to pull into the XML structure. You can see, at the bottom of the window, that you have a list of columns. Since it’s a selective query on Project ID, I can set up my criteria. See that I’ve constructed a query; given that it’s an Excel spreadsheet with generic SQL [Structured Query Language] format, I just need to enclose the table name in brackets. I’ll just enclose it with brackets and move on at this point.

 

You can map information from the database into an XML structure, into a page element with occurrences of topics. You do it through drag-and-drop from the source (Excel spreadsheet columns). We drag to our topic, the output structure. I’ve got these expanded out. It’s just a matter of dragging from one side and dropping onto the other. You’ll notice I use “Project ID” on a couple of different ones here. I talked about it in the spreadsheet.

 

For example, the Agency and Budget Year come from that Project ID itself. You just map down the line. In the title, say I uniquely identify each row in the spreadsheet by Project ID and description, so I concatenate the Project ID with the original description element, so it returns Project ID-desc[ription] as the identifier for this element.

 

I also want to talk about transforming from the form for the spreadsheet into the form we need for the map. For example, I’ll talk about “Agency.” It’s the first three digits of the Project ID. I do it for a concept called a functoid. I select a Java process for transformation and do it online, so it’s an online process for that function. I apply a couple of these functoids…[Mr. Stein applied several functoids].

 

At this point, we’ve done our mapping from the Excel spreadsheet. I finish up and associate the Project ID input variables together. So what passed to them is passed to the component, then you can save the document. At this point you can execute, so we go to the spreadsheet, grab a Project ID, and use it as input. So I execute the document, plug in the Project ID, and execute.

 

We’ll see now that we’re connected to the spreadsheet. It mapped in the page information, where the title is Project ID. You can see that Budget Entry Year is 2000. It’s a major project, with total fiscal budget between $1 and $2 million dollars. Had I applied all the information, we would have seen a similar result for topic for each of these others. We end up with a result that’s a wholly blown XML file, comprising (in this case) 5,147 rows in spreadsheets, and 12 topics for each page (about 60,000 elements). It shows total 2003 and 2004, and steady-state 2003 and 2004.

 

The time it took me to put this together in preparing this demo was less than one hour. It gives you a feel for the power of these tools—the scale and magnitude.

 

I’m going to turn it over to Iqbal, for faceted search-and discovery of data.

 

Mr. Talib:  What Brett just did reminds me of those cooking shows on TV; the chef starts to make something, and suddenly it’s all ready. Just to address XTM , XFML, and facet maps—there’s always confusion about these. From the XTM point of view, if we look at a book index in the front of a book, that’s a topic map. Let’s say the book’s not written yet: we have a map, and when we find stories, we make the topics real. We see the facet maps in the opposite direction. We have a set of documents that need classification; those are facet maps. XMFL is nothing more than a method to take them (facets and topics) and associate them to the resources themselves.

 

Slide 22  [The Big Issue with Search and Discovery (S&D)]:  We talk about searching as an application. In the final analysis, there are only data and applications. What are the issues with respect to “search?” I’m talking about the user (of the data) —not IT folks or other stakeholders such as owners of the data. I’m talking about the users. What are their concerns?

 

If someone hands you a book with no index, it makes no sense unless you know the title and chapter headings to know what it’s about, and look through the book. We find that in the industry today. Google [http://www.google.com/] can be forgiven because the universe it searches within is so large that we can forgive certain results not showing up. But when we’re talking about the government, where it’s all important, we can’t forgive search results that may be important but don’t show up.

 

Therefore, having an aerial view of topics, like the index of the book, is an important way to understand what is in the book. We want it presented in hierarchical categories (since at most, we can keep seven or eight topics in mind at one time). We want a reasonable number of topics to show up at any one time. If we have the ability to see these in multiple views, it becomes even more interesting. We can have documents classified by organization on one side, and by products and services on the other side. It’ll give us two ways to look at information.

 

Slide 23  [Evolution of Apps./User Interfaces and Data Rel’ns]:  We see from this slide that we have come a long way from the 1960s. On the applications and UI side, we have gone from single-application machines with a single user to: multi-applications; multi-computers; Web-based applications with no-name users; fast computers; cheap storage; and high-speed WAN.

 

On the data relationships side, we have progressed from data semantics provided by the user and the application, all the way to assigning meaning to data elements, relationships among them, and agreed-on external concepts (topics and facets).

 

Slide 24 [Reqs. of Data Stakeholders that are Driving Standards]:  In general, we should be clear that there are particular stakeholders who are shaping standards: information end-users; information creators (authors); IT people; and data owners. Each of these stakeholders has particular needs and wants.

 

Slide 25  [Specific Problems with S&D—w.r.t. Stakeholders]:  As I mentioned, when we conduct a search using Google, we could get a million results. If we try to examine the results page-by-page, the most we would be able to get to is 1,000 results anyway. Any results beyond 1,000, for all intents and purposes, will be invisible. The ability to have the complete result set classified into meaningful categories each time is vitally useful, because it gives the user control on how to further drill down—and derive value from the data. Now we can talk about search as an application.

 

It’s useful to step back to the previous two slides and look at the origin of how all this came about. So let’s talk about the evolution of how data used to be represented and why XML came about. Looking at the old way where data had context in the application only, it was the application and the user that determined what the data meant. A piece of data could be interpreted to mean different things by different applications. Over time, with faster computers, with the ability to parse information, we had multi-application systems, with many users creating and changing applications and data. It became more important to give meaning to data, and  XML is a basis for that.

 

I’m not going to touch on all the topics on this slide, but if you look at the bottom right, we’re moving toward assigning meaning to data elements, and agreeing on labels for information. The notion of relationships among data—in terms of what any application does with it, must satisfy the stakeholders of the database. As we assign meaning to data using technology standards like XML, they begin to play roles previously played by applications. I maintain that any standards and, certainly, any application must satisfy the major stakeholders of the data it represents. Typically, these may be users of the data, creators of the data, maintainers of the data (usually the IT folks), and owners of the data. XAware, for example, is a middleware company that helps IT folks integrate information from disparate sources (making their life easier).

 

We talk about four main stakeholders: the users, the creators, the IT people, and the owners. There are others.  For example, in a situation where salespeople try to sell an advertisement in a directory, they better have a technology to satisfy the end-user’s concerns. Otherwise they’ll never sell an ad. So let us focus on the end-user within the context of search technology.

 

So what does the first stakeholder (i.e., the user) look for? Using a Google-like search, you don’t want too many results. If you can’t find something good in the first few results, you need options, and it would be nice to have the ability to browse a taxonomy of the results set. We can assume that all information in data repositories in the government space, where the data is well–defined, is important. Really, you want to find the data you’re looking for easily and quickly. For example, if you type “heuristic” and get a million results, you can’t use them. But if the results are returned to you organized neatly in category order, then that could prompt you where to go, and then you can drill down that path. If you can search and browse simultaneously and iteratively within the results set, you can control the search path and end up discovering information you did not know existed. We call this serendipitous discovery. That in turn enhances the value of the data.

 

The second stakeholder, on the other hand (authors), wants to control the visibility to their documents. They place it in multiple categories to increase the chances of people finding it. For example, a document with financial information—you want bankers to find it. You’d place it in that category, as well as in categories where real estate people would likely look for it also. This is one of the shortcomings of the XFML standard. It doesn’t allow resources to be assigned to more than one category.

 

The third stakeholder, database manager, is obviously concerned about maintaining control over the data repository for which he or she is responsible. When the problem is not searching across just one database, but multiple databases that are disparate (e.g., in format, platforms) and dispersed (e.g., geographically, organizationally), the managers of these databases are leery of outsiders sticking their hands in their databases and disturbing their maintenance infrastructure. Thus, database managers want minimal intrusion.

 

And finally, the fourth stakeholder is the repository manager. Here, the manager may want a single point of search across multiple repositories, and simultaneous update, so that if one repository is updated, everyone sees it instantly. The repository manager may also want to synchronize data across all levels of access classifications. If the same piece of information is accessible at two different security access levels, a change in that information must be automatically and simultaneously seen by users at both levels. Most important, the application needs to address the difficulty that data repository managers are faced with in getting the consensus of all managers for access to their data. It took a federal law to say, “You’re all going to share.” There are still problems. They say, “What should we do? Build a different database, or normalize fields?”

 

Slide 26  [Specific Problems with S&D—w.r.t. Business Value]:  The big, important opportunity (and challenge) is to unlock the full social, economic, and intelligence value of disparate, dispersed databases. To achieve this vital goal, “better, faster” interaction—ideally in real time—between end-users and textual information is critical. We contend that there are five key pillars that can form the foundation to get to maximal social, economic, and intelligence value.

 

For example, with respect to search technology in general, you want to provide the notion of virtual aggregation (Pillar 1) to search across multiple databases—not change anything in any database. Not everyone who searches is an expert. One way to help is to bring back search results in categories to enable faceted search and browse (Pillar 2). Greater data visibility (Pillar 3) is important, because if all we get is “good” precision and recall, the presumption is that all that you want is within the first few items of the results list. It means any documents that are not visible have no relevance. That is a problem in the government space, where all information in the results list could have relevance. In that case, by searching with a progressively smaller subset of data, the specific document you’re looking for is increasingly more visible.

 

Virtual syndication (Pillar 4), on the other hand, is where you take any subset of data and provide it to another Agency. You say, “Here is an information subset that is fully searchable within the subset and that is accessible by your agency.” And you provide them the exact same information—just its subset—without needing a different structure or database infrastructure. It’s very helpful. There is no additional cost for hardware or RDBMS software license, and there is no intrusion on the database.

 

Finally, during the search, as users declare their search intentions more clearly step-by-step, it gives the ability to the repository owner to communicate relevant information or warnings. We call this real-time customer response (Pillar 5). If an expert user is drilling down, say, a chemical database, you could flash him or her a message, “There’s an epidemic caused by this chemical in this area. Please contact us if you are knowledgeable about its effect.” Targeted messages like that may have better effect than a general announcement posted by a general banner ad.

 

Slide 27  [Solving the S&D Problems—What Do We Need?]:  So to what Brett said earlier, we need certain tools to solve the search and discovery problems. XML is the language; XFML is its instance; and XTM is where you define a space with topics.

 

Slide 28  [Toward a Better Mousetrap for Search and Discovery]:  This slide highlights why faceted search-and-discovery technology is important—as I have said previously. We’re going to skip this pretty much.

 

Slide 29  [Search, Browse, Slice, Dice, Discover—by Facets]:  Now we’re going to show what happens with a visual aid. What you’re looking at on the left side is View 1 into a database with a complete picture of all categories and facets. We take one facet in View 2, and drill down along that axis. We can then look at the data from a different angle—from View 3 along another axis. Finally, in View 4, where we search by a keyword, say the block in the lower right corner is what you get to by taking isolated subsets of data. You’ll see all this better when I show you the real-time demo.

 

Slide 30  [Conventional vs. Faceted Search & Discovery]:  I’m not going to go over this chart that shows the differentiation between conventional search engines and faceted search-and-discovery technology. The ideal desire is to have all these capabilities: simultaneous searching and browsing; multi-category browsing; real-time categorization; unlimited n-dimensional data cells; many-to-many records and axes; virtual aggregation of disparate datasets; virtual syndication of customized datasets; and unlimited scaling. We need the flexibility of unlimited data cells. It’s an issue with all search engines. That is, as soon as we go beyond several million cells in the “database cube,” it gets difficult to handle the search and management of the data.

 

In the old days, a relational database management system thought of data as spreadsheets. It had advantages. The major disadvantage was, you couldn’t see the hierarchy. Hierarchy has been used for a long time, but with respect to data, it was modeled around spreadsheets. Only recently, with the XFML standard, do we agree on how applications will process information that is organized in hierarchical taxonomies. And these are useful.

 

Take the example of walking into a flea market. There are things all over the place. To find what you’re looking for may force you to walk from one end of the flea market to the other. You can spend your whole day there (i.e., lost productivity and business deals for both the flea marketers and you as the buyer). In Wal-Mart, however, all of the merchandise is organized neatly into aisles, shelves within aisles, and bins within shelves. Here, it’s more natural and easy for us to find the things we are looking for. We have now generated social and economic value for both the merchandiser and the buyer. So by having this many-to-many relationship, with a single topic belonging to many parents, it’s what XTM (Topic Maps) takes care of but that Facet Maps do not.

 

Mr. Ambur:  In that sense, in response to Joel’s question earlier, XFML is a subset of the XTM standard.

 

Mr. Walt Houser:  We need a tutorial on the distinction between RDF [Resource Description Framework, http://www.w3.org/RDF/] and XTM, because RDF is the other school of thinking.

 

Mr. Talib:  I’m not familiar with it.

 

Mr. Walt Houser:  These two groups don’t talk together well.

 

Unidentified participant:  Devin posed that to Steve Newcomb. The answer is long and complicated. He sent me 20 codicils?

 

Mr. Marion Royal:  I think Walt’s answer is more concise.

 

Mr. Houser:  One characteristic of RDF is it’s focused on metadata tagging of documents. I know it’s labor-intensive. If you assign it to someone who’s not adept at it, you get redundant information out of documents, because they’ll cut and paste. That’s the extent of my knowledge of it. I’m inclined toward a metadata approach that relies on a document approach. I’m also suspicious of things that look like magic.

 

Slide 31  [Search—via Google]:  This slide shows what I’m talking about. It’s a Google query. The large number of results are ranked by relevance, and they also are not organized by categories. You can’t drill down or backtrack. You can’t expect a user, when drilling down, to start over again, to backtrack.

 

Slide 32  [Search—via Faceted Search & Discovery]:  Here’s an example of an application done for the CRISP [Computer Retrieval of Information on Scientific Projects] database of Office of Extramural Research at the NIH [National Institutes of Health, http://www.nih.gov/]. You can see at the very beginning of your search what the CRISP grants database is—and how it is organized cleanly into hierarchical categories. This is the aerial view of the entire grants database. There’s no mystery. It tells how it’s organized by CRISP hierarchy, by fiscal year in which the grant was made, by state in which the grant was executed, and by principal investigator who led the grant research. It is in this way that  you get relevant views of data—as well as drill up, drill down, and slice and dice the data very easily and very quickly (in less than 200 milliseconds).

 

Mr. Houser:  So it’s comparable to the Yahoo presentation?

 

Mr. Talib:  Not quite. Yahoo has the beginning part, i.e., data organized by categories,. But the moment you drill down, you lose all semblance of the relation to the categories. Say, for the NIH’s CRISP grants database, you’re going to get a long list of search results with Yahoo. If NIH grants are in any of the categories in the beginning, you don’t have that relationship anymore. Instead, what you’ll have is that the first category will show some records, then another category with some records, then maybe it’ll show the first category again. So Yahoo does the aerial view fine, but when you drill down, you don’t have that same capability.

 

Slide 33  [i411 Discovery EngineTM: Example Applications]:  This slide shows some applications we worked on using i411’s faceted search-and-discovery technology. Besides the CRISP database, we also are powering the AIDS research database for the NIH Office of AIDS Research. Another application is for the multifaceted search of the export, import, and economic archives of STAT-USA, a branch of the U.S. Department of Commerce.

 

Slide 34  [Demo 2: Multifaceted S&D of OMB Exhibit 53]:  I’ll jump now into the second demo, picking up from Demo 1 which Brett showed us. [Editor’s note: The demo is not included as part of the slide presentation.]

 

Mr. Ambur:  I suggested that XAware and i411 use the OMB Exhibit 53 data for their demos, in part, because I went through the painful process of paring the Excel spreadsheet down to list only the IT projects that included the term “XML” in their brief descriptions. It would have been easier just to cut and paste, because I eliminated almost all of the rows, then posted on the XML.gov site the remaining listing of the five projects that included reference to XML in their descriptions.

 

Mr. Talib:  Since the Exhibit 53 database is in spreadsheet format, you’ll get a sense of how the faceting works as I show you Demo 2. You can see that right off the bat, you have an aerial view of  5,147 documents (i.e., agency IT projects). We can also see that the projects are clearly organized into five categories: by department, by project type, by investment type, by investment FY, and by budget entry year.

 

Then, for example, you can drill down to an agency that is of interest to you, say, the Department of Agriculture. You can see that you have a search list within Agriculture. This list breaks the USDA’s projects down by project type, by investment type, by investment FY, and by budget entry year. You can also “switch views” from one axis to another. It tells the spreadsheet columns to organize “this way.”

 

We also get  an exact count of the documents found, so it’s not as though a topic appears for which no records are associated with it—i.e., there are no dead ends. We know that filtering to a dead end is very frustrating, and takes your time.

 

So, in this way, I can select a major topic here, drilling down each time—and slicing and dicing down to the projects that are of interest to me or to those that I am discovering serendipitously as I search and browse. You can also see the actual project document at any time. So if I click on the details, I can easily see all the information related to a particular project.

Another example is a keyword search for “Web” or “portal.” Now, going back to a new search of the database, I’m doing a free-text search, like in Yahoo. I click on “Find,” and, instantly, I  found 267 documents among the 5,000+ documents. But unlike Yahoo that would have given me a long list of 267 results, my faceted search gave me the results organized by facets and topics within it, e.g., by agency, by project type, etc. If I click on the details of any document,  I see that all the textual search words (“Web” and “portal”) in the documents are identified in bold.

 

Mr. Ambur:  HHS has over 100 projects with “Web” or “portal” in their name?

 

Mr. Talib:  Right. If I do a keyword search for “office automation,” for example, there are 51 documents. This is how they’re spread out. Since we’re used to working with the Boolean search, it searches for either. You can see the blue and yellow highlights. You have the ability to drill down…if you take the total of investments between $1 and $210 million dollars, there are only 18 items that qualify. You can backtrack by saying you want to consider, for example, investment type. Click here, and select a different investment type. It has the ability to drill up and down. You can take off the free-text searched term. I’m still left with 87 documents under “Enterprise Architecture” with investment of $1-$10 million dollars.

 

Mr. Ambur:  There’s a link to this in the HTML version of Brett’s and Iqbal’s presentation. I didn’t have time to put it in the PowerPoint version. We’re up to break time. Are there some questions for Brett or Iqbal?

 

Incidentally, I did a search of their demo application using the term “records” and turned up 98 investments that are addressing records management in one way or another.  Likewise, a search on the term “forms” turned up a fair number of projects addressing forms automation, including quite a bunch at HHS.

 

Slide 35  [Conclusions—and Q&A]:  I don’t want to hammer this any more. There are lots of opportunities with faceted search and discovery—and there are appropriate technologies (XML, XFML, etc.) to help unlock the full value of the databases of different agencies.It’s all there.

 

Slide 36  [Contact Information]:  [Not discussed.]

 

Mr. Keith MacKenzie:  I’d like to know about DOM [Document Object Model] versus SAX [Simple API for XML, http://www.saxproject.org/] parsing and the performance implications of the two choices in your technology. Can you comment on that?

 

Mr. Talib:  i411 search technology is not dependent upon the XML technology. It turns out that XML is good for it. I could have gone to any database. You could do this exact search if the database was in SQL, etc.

 

Mr. MacKenzie:  So parsing a known format, going to an object model, and searching?

 

Mr. Talib:  Yes.

 

Mr. Ambur:  OK, lets take a break and come back in about 10 or 15 minutes.

 

Break

 

 

Mr. Ambur:  We’ve had some people come in since we went through our introductions earlier. If you’d like to introduce yourself and tell us a bit about your interest in XML, we’d like to hear from you.

 

[Introductions]

 

Mr. Ambur:  Is anyone on the phone anymore?

 

[Marion Royal on phone]

 

Mr. Ambur:  Our next presentation was originally scheduled in April [2003] but some of the participants were unable to make it due to the snow storm.  When first I learned about Extensible Forms Description Language a number of years ago, it was because I looking for something like it. I didn’t discover it because it was proposed by the UWI.com, which has since become PureEdge. I discovered it because it addressed a business requirement of concern to me, a business requirement that was being ignored by most technology vendors. That’s how I became aware of XFDL. We had scheduled a presentation on the Air Force’s implementation of a PureEdge’s product in April, but we didn’t get into the innards of the proposed standard itself. I have a personal interest in this. I hope it’s of interest to you as well. So with that, we’ll turn it over to you, Keith

 

 

Keith MacKenzie

PureEdge

XFDL Overview

 

Slide 1  [Title Slide]:  My name is Keith MacKenzie, of PureEdge. I’m going to talk about what the business challenge is. We have a lot of paper forms. We need to replicate them in the electronic world. We worry about things inherent to paper: security, accountability, manageability, etc. I’ll talk about paper—what’s good about it—can we surmount the challenges of moving it online? Then, we’ll talk about the advantages of having the information online.

 

Slide 2  [XFDL: Extensible Forms Description Language]:  Here is the XFDL history. It was developed by PureEdge and Tim Bray in 1998. It was submitted to the W3C as a note. The W3C acknowledged the note, so XFDL was given use in government and billing. It’s well-established for electronic forms. It’s a way to remove paper from the process. It’s primarily designed to deal with auditability and securing of transactions. It evolved to become a dynamic presentation of data with auditability, so it’s a dynamic interface for security and transport of data.

 

Slide 3  [Understanding the Need for Forms]:  Business forms represent 80 percent of the data within government, so why are they so pervasive?

 

Slide 4  [Understanding the Need for Forms (continued)]:  Paper has structure naturally associated with it. There’s a strong association for the user to put the right kind of information into each box, so it naturally does that.

 

Slide 5  [Understanding the Need for Forms (continued)]:  They’re also user interfaces. They offer a simple and direct way to capture the information. They’re also an efficient way to archive things. Some of these things have been around for hundreds of years—things associated with case law and archiving, so we have a long paper history.

 

Slide 6  [Understanding the Need for Forms (continued)]:  They also capture the fixed state of the transaction when it’s performed. That’s very important. Electronic representation tends to be malleable, and changes as things change. We need to be able to lock down the world. Paper does that well. It’s more difficult in the electronic world.

 

Mr. Houser:  Another important dimension is provenance, i.e. proofs of authenticity or past ownership. Provenance relates to the concept of malleability. With paper documents, one can examine the stock, the letterhead, the ink, any initialing, marginal notations, and other artifacts typically lost when documents are converted to electronic form. See http://db.cis.upenn.edu/Research/provenance.html.

 

Mr. Mackenzie:  All these are challenges as we take it online. If you have questions as we go, stop me.

 

Slide 7  [Electronic Forms]:  To be better user interfaces, electronic forms need to be valid records. That means everything on the paper needs to be the same online.
How can we be sure that everything that represents the transaction was saved? It also needs to be accessible. It’s nice if it’s accurate, but what do we do in terms of integrating it with other systems in terms of interoperability? Finally, because it’s online, people think it needs to guide you. There are things you can do online that you can’t do on paper; validation, etc. This all needs to be addressed in terms of online transactions.

 

Slide 8  [Valid Transaction Records]:  The legal requirements for paper records have been established over many years. XFDL replicates many of those requirements. The requirements have to do with security, non-repudiation, and auditability. Security has to do with the data stream. I’ll talk less about that. There are lots of ways to secure data now—bit streams using standard protocols—but what’s not addressed is non-repudiation and auditability. Non-repudiation is the ability of someone to attach an indication of the integrity of the data.

 

Security and auditability are built up to non-repudiation, then non-repudiation coming from the form signer: “Was that person authorized to sign? Who was it? How are we sure it was the right person?” Basically the kind of thing addressed by PKI infrastructure, or other authentication techniques. There is an e-authentication effort to do that. And how do we know the document’s not altered? The final thing, which people don’t often take into account, is that the context in which the data are served is an important piece of the transaction record. You can’t separate context from data and have a secure and non-repudiated aspect of the transaction. There has to be context.

 

Slide 10  [Valid Transaction Records (continued)]:  Non-repudiation we talked about already. It’s very difficult to tackle in the computer world. A lot of it is the philosophy on separation of data and presentation. One of the things that’s been overlooked is that the context must be bound together with it. Then you can have auditability. The transaction has to have enough information in it to reproduce it later if there is some dispute around what the transaction was about.

 

Slide 11  [Valid Transaction Records (continued)]:  Most e-forms technology doesn’t provide this information. With the separation of data and presentation…I’m not saying it’s impossible, but the administrative overhead of templates and data is difficult, so we need a series of infrastructures, so that the two pieces, if physically separated, can be brought together logically. It’s difficult, but possible.

 

Mr. Ambur:  John Kane and NARA [National Archives and Records Administration] are worried about storing permanent records in perpetuity. It’s difficult now, and they must think about preserving them 100 years from now.

 

Mr. MacKenzie:  Sure. I go into it a little on how to resolve that.

 

Slide 12 [Valid Transaction Records (continued)]: A little on HTML: HTML is really just tagged data. No presentation information is sent back and forth between the end user and the server, and the presentation information is not captured. We could apply a signature, but it’s meaningless, because we’re just signing the data, not critical information about the presentation like font size, placement of labels, etc. “What order were questions asked?” etc. (iImportant constituents of the non-repudiation argument). So the basic principle is that everything’s all retained, so the whole record can be brought up later all as one unit.

 

Mr. Ambur:  An important aspect that Mathew McKennirey—as a security expert—brought up is, how do you know what you are signing if the presentation is separated from the data?

 

Mr. MacKenzie:  Right.

 

Slide 13  [Valid Transaction Records (continued)]:  So which of these would you prefer to sign? You wouldn’t want to sign the first one. The data are stripped out. You could have a separate template file. Say they get out of synch; later on, you bring out a different version. Even the questions and answers don’t match. The entire transaction could change its meaning, so the bottom one is what I prefer to sign. It’s a replica of the transaction. You’ll see later how it’s all contained in a single XFDL file—secured, and locked together.

 

Slide 14  [Making Data Accessible]:  The other thing we have to do is make data accessible. We have to be able to take the data and use them in another context as well. What XFDL attempts to do is, make a secure method to execute transactions, but also make a way to extract the data to populate other applications. There are many ways systems have of being able to do that.

 

Mr. Ambur:  The point I make is that most vendors propose a false choice by saying you have to separate the presentation from the data in order to meet the needs for displaying information on different devices. The answer is, you have to have both – the ability to tie data to its legal presentation while at the same time being able to display the data on other devices.  XML and particularly XFDL facilitates both purposes.

 

Mr. MacKenzie:  You don’t have to choose. That’s what we provide in XFDL. Version 6 XFDL directly supports schema instances and namespaces, so with the XFDL namespace, you can have others in the document also. A good example is Form SF424 (Grants Cover Page Form). The government is prescribing an XML schema for this form. XFDL supports the data in their own namespace. It also has an XFDL presentation layer on top. It presents the data in a prescribed format. The data can be signed, as well as the document.

 

Mr. Will Gorman:  The 424 is something proposed in the XML Web Services Working Group. It’s not official. It’s being used to demonstrate the ability to use forms.

 

Mr. Ambur:  The E-Grant initiative is specifying the data elements for the SF 424, which is an official, standard form (SF).

 

Mr. Gorman:  Anything we used came out of the Web Services Working Group.

 

Mr. Brand Niemann:  The Web Services Working Group has been provided with an official version of the E-Grants schema, and we encourage pilots to use those now.

 

Slide 15  [Making Data Accessible (continued)]:  This is the general idea—one file with layers. The first is presentation, the next is business logic, then there’s the data instance itself. It directly adheres to a definition like SF424. So you might have a [Microsoft] Word spreadsheet, or a PDF [Adobe Portable Document Format], or an Excel spreadsheet to attach, so it’s the ability to attach to any binary format itself. So the whole XFDL is a security and presentation format for all those components.

 

Mr. Ambur:  I’d like to make a point about this slide.  There has been litigation about [NARA] General Records Schedule 20 [GRS 20, http://ardor.nara.gov/grs/grs20.html]. The only way agencies can legally destroy records, once they have been received or created by the agency, is under a schedule by NARA.  So they’re supposed to think through the business logic and then get NARA’s permission to destroy them after certain periods of time. Some records may be permanent and eventually go to NARA. To avoid requiring each individual agency to redundantly schedule common types of records, NARA issues General Records Schedules. Under GRS 20, they’re issued the authority to destroy any electronic record once it’s converted to paper. That’s been litigated.  The plantiffs argued that significant values intrinsic to electronic records are lost when they are converted to paper. I hope this group doesn’t need to debate that point. The District Court agreed and ruled GRS 20 invalid. They said you have to keep the original electronic version.  On appeal, the Circuit Court overturned the decision, and said it was not making a judgment as to whether NARA was right but merely whether NARA was within its rights to issue the schedule. The key point is, even the Circuit Court said that GRS 20 only applies to electronic records that are fully and completely copied to paper, including the formulas and associated information etc. behind spreadsheets, i.e., all the values. Nobody does that and I don’t know if it is even possible to do with spreadsheets, for example.  Thus, in my personal opinion, GRS 20 does not literally grant authority to any agency to destroy any electronic records.

 

XFDL provides the ability to do as the Circuit Court said agencies must to to comply with GRS 20. The point is, GRS 20 does not give the authority to destroy any record that’s not fully and completely copied. If they’re not printing the formula behind the spreadsheet, they’re violating laws.

 

Mr. MacKenzie:  You could print the XFDL, and it contains the formulas.

 

Mr. Ambur:  Please note that I am not in favor of converting electronic records to paper, because values are lost.  The point is that XDFL could enable agencies to do so lawfully.

 

Slide 16  [Untitled]:  The point here is to directly support these schemas.

 

Mr. Houser:  Do you use the XML Digital Signature [XML DSIG] standard?

 

Mr. MacKenzie:  No, we don’t. There are about five signing technologies. We’re moving toward XML DSIG. We’ll support it in a subsequent product.

 

Mr. Ambur:  I encourage that type of question.  Vendors need to hear them from us to know that they are important.

 

Slide 17  [Simplifying Transaction Process]:  The idea is to simplify. Traditional paper forms are difficult to fill out, difficult to navigate and complete. The idea is of a rich front end; an alternate experience that allows one to fill out the form. But to arrive at a traditional format that represents the paper is important. XFDL does it very well. Then you want it to prompt the user for a predetermined selection of uses, enforce proper data formats, etc. Forms are notorious for round-tripping if they’re filled out incorrectly. A lot of these can be caught by embedding business rules up front at data capture time. Then being state-aware is key—so the idea is that a paper transaction or record has multiple states: it goes through an approval process; there are four or five approvals for each section. Someone looks over the form; verifies it; signs it; and routes it to the next party. It’s important for the form to know what state it’s in.

 

Mr. Houser:  Is the digital signature comprehensive?

 

Mr. MacKenzie:  The way you set it up in XFDL allows for  XML signatures. In fact, the XML Signature Standard uses the same signature filter approach as is used in XFDL.  One of our engineers served on the XML signature team, and his contribution to the committed was the XFDL concept of signature filters. Because of signature filters, you can stipulate exactly what parts of the XML file are being secured for each signature on a document. So when you sign “Section A,” you only sign Section A. As a result, the idea of multiple, parallel signatures is also supported

 

Mr. Houser:  My concern is non-repudiation of previous parts of the form.

 

Mr. MacKenzie:  When you sign Section A, it locks it down, unless the person deletes the signature.

 

Mr. Houser:  If “B” depends upon “A,” can a person for example, change B after A is signed?

 

Mr. MacKenzie:  Yes, provided the derived value in B is not signed by A. Anything in B can change. A can’t change however. Typically in a process, A gets signed by one person, then B and A get signed by a second person, etc. When you sign Section B, it’s overlapping, so you’re signing A and B in the filter. In order to change A, all signatures locking A must first be deleted by the person who applied their signature.

 

Mr. Houser:  Will there be a business rule for the signature?

 

Mr. MacKenzie:  It’s implemented through the signature filter and XFDL.

 

Mr. Houser:  How do you talk to workflow products?

 

Mr. MacKenzie:  Our format can be used by any parser or tool. You can send the whole document or particular schemas to the tool. It can pick it up and understand the document at a certain state.

 

Mr. Houser:  Have you done that?

 

Mr. MacKenzie:  Yes. With IBM, we built a toolkit for the IBM Content Manager. We were able to submit documents into the repository, and have document routing with the content manager doing the routing. The other one is Software AG. They’re here [participating in the meeting]. We’ve done work with them, and gotten an excellent XML database—whole schemas they can put into the Tamino database

 

Mr. Steve Jacek:  We’ve also integrated third-parties with Livelink [Open Text Corporation Livelink, http://www.opentext.com/livelink/]…

 

Mr. MacKenzie:  The advantage of XFDL is, it’s a standard tool. The specification is published and freely available. Any tool can use XFDL. PureEdge builds one, but it doesn’t have to be PureEdge.

 

Mr. Houser:  For a process outside the control of a workflow process tool—how does it go to an outside source? Say a veteran applies for a benefit, and he needs concurrence by an outside party: he needs to send it by email to a physician or university, and then have it come back again. How would it do those handoffs?

 

Mr. MacKenzie:  It’s the integrator’s decision. He may do the content or back-end side, or the form side. It’s an implementation choice. There are multiple ways on the back end; workflow rules, or email. You can attach it. When it fails a step, the engine can say, “There’s a problem. Attach this XFDL and send it to the person to take corrective action.” There’s no need to bring the template and data together, because they’re together as a single object.

 

Mr. Gorman:  At the failure point, you can involve a human; or, with a third-party, you can send it to a doctor, and he concurs (or not) and sends an email back. Since it’s one document, these things are recorded inside the document itself, so when you receive it back into your workflow, it knows whether the physician concurred. You can define an end-state, and say, “This document is complete.”

 

Mr. MacKenzie:  It depends upon the infrastructure. You can implement lots of kinds of solutions, so when the integrator walks up and says he has a big back end, or there’s a disconnected user, for example—with, say, no network functionality—then you can say, “Based on that, we better put the logic inside the document.” If it uses the whole back-end infrastructure, though, you might make the opposite choice: not put the logic into the XFDL document, but use the document as an object, and use the back end to route it through multiple stages.

 

Mr. Houser:  Is there a certain amount of logic for a “standalone,” that hands off to the workflow tool?

 

Mr. MacKenzie:  When you build the document, you determine that.

 

Mr. Houser:  How about multiple authors?

 

Mr. MacKenzie:  Yes. XFDL will support it.

 

Ms. Carrie McCaslin:  With the out-of-the-box functionality with the workflow and established gates, do you think this is useful on capturing data elements and knowing when they’re ready to promote?

 

Mr. MacKenzie:  We’ve had a number of relations with document management, content management, IBM, records management…

 

Ms. McCaslin:  There’s one, in particular, that I’m interested in—Windchill, by PTC ProMetric Technology [http://www.ptc.com/appserver/it/icm/cda/icm01_list.jsp?group=201&num=1&show=y&keyword=37]. Do you see applicability there?

 

Mr. MacKenzie:  We try to focus on core computer technology, and leverage our partners in terms of putting in place those other tools on the back end.

 

Ms. McCaslin:  And the preference, if you put the business logic in the workflow—are there implications?

 

Mr. MacKenzie:  The advantage on the forms side is that the logic is in the form it is run by the end-user desktop, and we know there are a lot of spare processor cycles on those desktops. Alternatively, there is also an advantage on the workflow side to have it centralized, i.e. run on the server side. XFDL provides flexibility of approach, so it’s flexible on ways to implement to achieve the business needs.

 

Mr. Ambur:  At last month’s meeting, we had our first presentation by hardware vendors. It focused on XML acceleration in the hardware.

 

Mr. MacKenzie:  “Compute” statements are extremely fast—microseconds—so Compute on the client side is no problem. On the server side, XML takes time to parse, so there are some architecture considerations to go after for any solution, including XFDL.

 

Slide 18  [Simplifying Transaction Process (continued)]:  This is the first page of a number of screens of the form.

 

Slide 19  [Simplifying Transaction Process (continued)]:  This is the form. It’s very difficult to fill out, so the wizard is key in helping the user.

 

Slide 20  [XFDL—Quick Feature Overview]:  So let’s get into XFDL in a little more detail. It’s document-centric. Everything is in that one file. The XFDL and schema are in their own namespaces, so it’s easy to extract the data from the document after it’s archived as a complete record. It simplifies record administration, and replicates the features of paper forms.

 

Slide 21  [XFDL—Quick Feature Overview (continued)]:  XFDL is a fourth-generation program language. It has an extensible computation language and allows for its own smart decisions. We talk about the Compute engine being “aware.” It can make fields invisible or visible; add pages; or change the allowable navigation of the user. Part of the XFDL is to allow a form-design professional to create forms to control the user’s experience. For example, as soon as a form is signed, you might lock it down so certain pages can’t be used  any more, but as long as it’s not signed, you can change it to suit the user.

 

Ms. Robin Mattern:  An XML document is usually static. What do you do about that?

 

Mr. MacKenzie:  The Compute technology is part of the XFDL specification. It’s prescribed. It can understand the functions. You can write an engine that does the compute. We use the viewer. It takes the XFDL document, and presents it to the viewer.

 

Ms. Mattern:  So the PureEdge viewer is the technology?

 

Mr. MacKenzie:  One or two other companies have written their own. It also goes to the following of XFDL. A lot of people are creating tools to convert formats into XFDL.

 

Mr. Houser:  You know you’ve made it when Microsoft and Adobe use it.

 

Mr. Jacek:  They’ve bought it.

 

Slide 22  [XFDL—Quick Feature Overview (continued)]:  One of the reasons is, you have a Compute system that is assertion based. All other technologies use JavaScript. It’s a procedural thing. XFDL is state-based. It makes assertions, and makes sure all dependencies and mechanisms are kept in synch.

 

Mr. Ken Sall:  Can you compare and contrast the features in terms of presentation and bringing up and hiding parts…compare it to Information Path from Microsoft?

 

Mr. MacKenzie:  I’m no InfoPath expert; it’s a different technology. You take an XML chunk, and use XSLT translations to style it as a user interface, so there are two or three components at run-time that represent the user experience. How is it different? XFDL saves three or four pieces as one representation, so it’s pretty different. In terms of dynamic capability, Information Path is tied to schema instances. XFDL is at the presentation layer, on top of the schema instance. You can also put fields in the document that don’t appear in the schema. For exact details on InfoPath, please talk to Microsoft about it since I am not an InfoPath expert.

 

Mr. Sall:  We had a presentation on it. I wanted to see the contrast.

 

Mr. MacKenzie:  We’ve done XML for seven years. We’ve run into lots of problems and solved them. They’re starting down that path.

 

Mr. Sall:  Standalone?

 

Mr. MacKenzie:  Standalone or together. Either.

 

Mr. Ambur:  The first time I heard of XFDL, I wondered about the degree to which it might be a competitor to PDF. You mentioned that with XFDL presentation is in one namespace, and data in another. I think of PDF as doing essentially the same thing, except that their presentation instructions are not human-readable.

 

Mr. MacKenzie:  It’s difficult to parse, and it’s not human-readable. It’s difficult to get the information into a usable format. There’s also a benefit in XFDL technology in terms of its dynamic nature. You can dynamically change on the fly. You’re not locked down with the signature. PDF, since it’s about static presentation, is not about being dynamically changeable. It has difficulty doing it. For exact PDF details, it would be better to ask an Adobe guy.

 

Mr. Ambur:  I give Adobe a lot of credit for listening to their customers. Right now, they’re working with NARA and U.S. Courts on an international standard for archiving records called PDF/A. They’re stripping out things like scripting that cause records to be unreliable.  Adobe obviously has a large installed base, and we should take advantage of that to leverage the enlighted use of XML. We’re not here to promote one vendor or technology over another. We’re here to foster understanding. Their technology is very good for what it is designed to do. It provides an exact replication of documents. XFDL is about dynamic integration of data, as well as controlling the presentation of documents as reliable records over lenthy periods of time.

 

Slide 23  [XFDL—Quick Feature Overview (continued)]:  XFDL is human-readable plain text. The semantics are known publicly. The specification is available for whoever wants it. XFDL also has a schema definition so you can Schema validate an XFDL document as well. You can also validate the instances within their own namespaces. Finally, XFDL is entirely XML, and  all of it can be parsed and processed with any XML-compliant tool.

 

Slide 24  [XFDL—Quick Feature Overview (continued)]:  There is choice and flexibility. Typically, you like to see the whole document submitted to the back end. In some cases, you’re not worried about transaction records, so the ability to submit only data is supported. You can submit all or part of the documents.

 

Slide 25  [XForms]:  Our technology relative to XForms: XForms is an emerging standard. We’re active in its development. One of the challenges of XForms is large: PDA [Personal Digital Assistant] and telephone vendors are in the group. They wanted to separate the presentation and the data so form data could be rendered on various non-traditional platforms. XForms loosely describes what the presentation and data should look like. So you might have a selected widget. How it specifically looks on a form is up to another language to prescribe, so XFDL can serve as a presentation language for XForms data.

 

Separation of data and presentation does result in some problems. In XForms 1.1, I believe the W3C will be looking toward addressing this problem. Until then, XFDL is a great way to bring XForms data together, with an XML representation of the presentation in order to provide good records with a high degree of data interoperability.

 

Slide 26  [XForms Components]:  On top of the XForms model is a user interface. It needs to be a specific XML presentation mechanism, so that’s why we talk a little about XForms.

 

Slide 27  [XFDL and XForms]:  XFDL supports arbitrary data instances. It can host an XForms data instance. The instance exists in its own namespace, and is bound to the XFDL presentation. The instance data can be easily extracted for use in other systems.

 

Mr. Houser:  Does XForms support the business logic?

 

Slide 28  [XFDL and XForms (continued)]:  It does. It uses XFDL-like computation, which supports simple computes. A senior engineer of mine, who is on the committee, contributed this idea based on XFDL. Unfortunately, because of the scope in XForms and the need to get version 1 of the standard complete, they only went one step down the Path. Later, in [versions] 1.1 and 2, they will add a lot of compute power (out in the 2005, 2006 timeframe).

 

Mr. Houser:  It looks as if there are problems with a conflicting specification.

 

Slide 29  [XForms Future]:  The latest XFDL adopts some of  the XForms technology. In the future, when the XForms standard is more mature and can meet all of a customer’s real-world needs, XFDL will further embrace it. Ideally,  XFDL will become a premium host language for XForms where security and dynamic capability are key requirements.

 

Mr. Jacek:  We authored the Compute engine used in XForms.

 

Mr. MacKenzie:  We also adjust our development of XFDL based on the XForms committee’s work.

 

Mr. Ambur:  I’d pose some of those questions to other vendors.  They may not have any incentive to coalesce around XFDL as a standard.  Indeed, it is in PureEdge’s interest for other vendors not to coalesce around an XML-based standard for presenation, because that means they have the market to themselves.  So it’s incumbent upon us as users to push the other vendors to support such a standard.

 

Slide 30  [Structural Overview of XFDL]:  This is an XFDL document. It has XML tags; a namespace declaration; a form global area. It contains the two options, then pages of a presentation layer.

 

Slide 31  [Structural Overview of XFDL (continued)]:  Here’s an example of fields, radio buttons, etc., all prescribed through markup tags.

 

Slide 32  [Structural Overview of XFDL (continued)]:  Here’s an example of the options used: so, within the title, you can see how the XFDL marks up and explains information about this label. It tells us it’s Times New Roman [font style], 48 pitch [font size], and it’s bold. A human can look at it and understand. It also has the width, and tells us it’s center-justified. That’s how XFDL describes it.

 

Mr. Sall:  Why, now that we have Cascading Style Sheets [CSS]?

 

Mr. MacKenzie:  XFDL evolved over a long period of time. It’s just the way we do it. CSS is more for transformation.

 

Mr. Sall:  Font information is in the realm of CSS.

 

Mr. MacKenzie:  You would like to see CSS as a standard within XFDL?

 

Mr. Sall:  Yes.

 

Mr. MacKenzie:  That’s a good question. I’ll have to take that and think about it.

 

Mr. Houser:  Your style and font information is parsable by XML, which CSS is not.

 

Mr. Sall:  That doesn’t mean you can’t combine the two.

 

Mr. Houser:  I’m not sure whether there’s an advantage to being parsable.

 

Mr. Sall:  If you look at an XML file in Internet Explorer, it’s using CSS to present the XML to you.

 

Mr. MacKenzie:  A good example is, adding RTF to the language. We do have RTF support, but carry a plain text value because you may have a need to reintegrate the data.

 

Mr. Gorman:  We explicitly state within the document what it looks like. It’s all there—one document, explicitly stated.

 

Mr. Sall:  The disadvantage is, what if you have a bunch of forms with the same style information? Why report?

 

Mr. Jacek:  It goes to Owen’s comment of the value of a piece of paper. For example, say you’re trying to capture a legal transaction. What are the legal challenges you face in the future in capturing the paper? When you start separating pieces, you start losing some of what the paper is saying.

 

Mr. MacKenzie:  Are you talking about the CSS in the document?

 

Mr. Sall:  A reference.

 

Mr. MacKenzie:  If it’s contained within the document, it’s no problem, but if it’s a reference there’s a problem, because when it’s signed, what happens to the reference? The source of the reference can change, possibly altering the transaction meaning, yet the signature does not break.

 

Mr. Houser:  You’re beholden to the source of the CSS.

 

Mr. Ambur:  In terms of records management, that’s just another risk factor that should be taken into account.

 

Mr. MacKenzie To answer why the CSS is not directly embedded into XFDL…I need to get a better answer. If we are talking about CSS references, however there is clearly a non-repudiation risk with the approach.

 

Mr. Bryan Quinn:  The concept is, you’d store a template that maintains the central structure of the form, but when you distribute the instance, the structure is still there, but you still have a centralized place.

 

Mr. MacKenzie:  You have a template repository. The user fills it out, and as he’s doing so, he’s filling in value tags, adding markup, and signing whose instance of the template it is. The record is signed and archived. The data can potentially be used for another system as well.

 

Mr. Ambur:  Steve Levenson, at U.S. Courts, has also pointed out that the fonts are also an issue.  It may be necessary to embed them in the record to ensure they are available when necessary to present the record at a later time exactly as it appeared when first created.

 

Slide 33  [Structural Overview of XFDL (continued)]:  Here are some other form things we can do.

 

Slide 34  [Structural Overview of XFDL (continued)]:  Then here’s a view of a Compute statement.

 

Slide 35  [Structural Overview of XFDL (continued)]:  Then here’s the decision logic. It can do an “if, then, else” thing.

 

Mr. Houser:  Do you rely on the object model to do the identification of the instance?

 

Mr. MacKenzie:  Yes.

 

Slide 36  [Structural Overview of XFDL (continued)]:  Then here’s the “Label” slide. This is the logo of British Columbia, Canada. The image is in the document.

 

Slide 37  [Pertinent Features of XFDL Structure]:  This is the idea of an attachment to a file—so when you attach a binary format, you have to Base 64 encode, so that’s the Base 64-encoding of the logo. The whole thing is signed, and can later be rendered un-Base 64-encoded, and you can see it

 

Slide 38  [Pertinent Features of XFDL Structure (continued)]:  It provides a rich ability to define what parts of the document are signed by different signatures.

 

Mr. Ambur:  The previous slide reminds me of another relationship I want to raise.  I see that you used Base 64-encoding for the agency logo. Wouldn’t that be an appropriate application for SVG (Scalable Vector Graphics)?

 

Mr. MacKenzie:  You can put SVG or any other appropriate one.

 

Slide 39  [XFDL Digital Signature Filters]:  One point is that forms, if they are to be intelligent data capture agents, need to define granular filters. They might have dynamically changing documents, based on the logic within them or on the server side, so the ability to sign or not sign parts that are dynamically created is very important. So being able to specifically control what’s signed and locked down is a key element.

 

Slide 40  [Conclusion]:  XFDL is document centric; it contains assertion-based document-control language; it’s machine- and human-readable; and it’s an extensible language, so it can extend as the situation needs.

 

Mr. Kevin Williams:  I feel for government guys, because with emerging technologies it’s kind of “See which way the wind’s blowing.” It  seems that the W3C is building out this suite of technologies. I was wondering whether XFDL has a life in that over the next two to three years?

 

Mr. MacKenzie:  SVG seems to be moving forward toward the business space. How it will evolve, I’m not sure. One of the things is how emerging technologies partner with XForms, so “as XForms develops, how do the presentation layers fit?” I have strong hope that XFDL will be a primary, premium presentation layer of XForms.

 

Mr. Williams:  Let me rephrase the question: Microsoft, a year from now, implements (in Internet Explorer) an XForms interpreter. You have legacy XFDL stuff. Is there a migration path from XFDL to XForms available to citizens?

 

Mr. MacKenzie:  . Yes. That’s a strength. Data from an XFDL document can be picked up and placed into other host languages, because the XML data can be structured directly to be whatever form you want it in. Also, XFDL is a freely-available specification, so if you want to translate into something else, you can do that.

 

Mr. Williams:  The viewer is free?

 

Mr. Jacek:  We have a couple kinds.

 

Mr. Williams:  For the citizen.

 

Mr. Jacek:  We have a rendition that’s free.

 

Mr. Houser:  Lots of our users have problems even with having to download the Adobe Reader.

 

Mr. Gorman:  I just want to mention that our file size is small, and our reader is small.

 

Mr. MacKenzie:  PureEdge will be promoting XFDL as an implementation approach for XForms data.

 

Mr. Gorman: At least one vendor has said it won’t comply with XForms.

 

Ms. Theresa Yee:  For what reason?

 

Mr. Gorman:  Their own proprietary reasons for the forms business and how it relates to product lines.

 

Mr. Ambur:  Might that be an especially large vendor?

 

Mr. Gorman:  Very large.

 

Mr. Ambur:  This relates to Kevin’s comments. We need to tell the vendors what we need. Mark Forman has made strong statements regarding the FEA. We need to make sure the FEA gets the logic right, and if and when it does, we need to insist that vendors adhere to it in order to do business with us. In moving forward, in this forum, we’re trying to foster a clear understanding of the potential of XML in relationship to business needs of government. I’ve been maintaining for years that all the software programming logic anyone needs to fill out any form and file any record anywhere in the world should be built into their desktop client in a standards-compliant way.  That is the vision PureEdge seems to be espousing too.

 

Mr. Jacek:  PureEdge grew up in that way—very active in standards. We don’t believe in hiding our technology. We do have a proprietary product, but it’s freely available at our website, and it’s noted at the W3C. One of the things we’re driving for is that it be free, standardized, JITC certified—things we’ve done. That plays into what the government has been saying for 20 years are the most important things.

 

Mr. Ambur:  Any questions or comments? Thanks, Keith.

 

Mr. Ambur:  I’m glad to know that the teleconferencing capability facilitated the needs of my co-chair. The last thing I’d like to mention is that I’ve been exploring the possibility of forming a community of practice around the potential to render the records-management metadata elements in an XML schema. Jim Whitehead, the “father” of WebDAV, has some graduate students who put together a draft. It needs vetting and review from subject matter experts. At this point I’d just like to throw out that notion for your information, but if any of you is interested in participating, please let me know.

 

Mr. Houser:  More in the RDF spirit of things than Topic Map?

 

Mr. Ambur:  Yes.  In the DoD standard, there’s a requirement that the basic metadata set for records management purposes can be extended to include user-defined elements.  I believe there should be a small core set for records-management metadata for government application government-wide. In fact, the DoD standard does set out such a set. Then individual agencies can extend that set for their own, more specialized purposes.

 

For those who are interested, we’re reconvening at about 1:15 p.m. for the Registry Project Team meeting. Joe Chuisano from Booz Allen Hamilton will be here to talk about the ebXML registry specification and Joel Patterson will talk about native XML databases.

 

 

End meeting.

 

Attendees:

 

Last Name

First Name

Organization

Ambur

Owen

FWS

Bach-y-Rita

David

Treasury

Barr

Annie

GSA

Burling

Dennis

Nebraska Dept Env Qual

Ellis

Lee

GSA

Fong

Elizabeth

NIST

Gill

Ken

DOJ

Gorman

Will

PureEdge

Hamby

Steve

Software AG

Hassam

Amin

I411

Hilt

Chris

Altum

Horneman

Steve

XAware

Houser

Walt

VA

Hunt

Kathy

SAIC

Jacek

Steve

PureEdge

Kanaan

Muhan

DynCorp

Kane

John

NARA

MacKenzie

Keith

PureEdge

Mattern

Robin

INET Purchasing

McCaslin

Carrie

NASA

McKennirey

Matthew

Conclusive

Niemann

Brand

EPA

Patterson

Joel

Software AG

Quinn

Bryan

Software AG

Royal

Marion

GSA

Sall

Ken

SiloSmashers

Stein

Brett

XAware

Talib

Iqbal

I411

Tang

Ning

Fujitsu Consulting

Troutman

Bruce

8020Data

Weber

Lisa

NARA

Williams

Kevin

BlueOxide

Yee

Theresa

LMI