U.S. Census Bureau

Spatial Data Storage and Topology in the Redesigned MAF/TIGER System

David Galdi

Abstract: Since the late 1980s, the Geography Division has utilized the Topologically Integrated Geographic Encoding and Referencing (TIGER®) System to provide geographic support for Census Bureau surveys, censuses, estimates, and partnership programs. While this system has served its purpose well, changing technology and requirements are dictating its replacement. The new system will utilize commercial off-the-shelf software, including Oracle Spatial and Oracle Spatial Topology Data Model. As with its predecessor, the new system will utilize persistent topology. It will also merge the largely non-spatial Master Address File (MAF) with the spatial TIGER database to form a new database. This paper examines issues related to topology and selection of a spatial data storage mechanism for the new database, as well as issues regarding integration of spatial and non-spatial data. It also addresses some of the early design implementation decisions for the new system.


Background

The Census Bureau has a long history of innovation in data collection and processing techniques. The Geography Division of the Census Bureau has pioneered efforts in the field of automated cartography, including development of the Address Coding Guide in the 1960s, the GBF/DIME files in the 1970s, and the TIGER system in the 1980s (Marx 1986).

The TIGER system, still in use 20 years after its inception, utilizes topological data structures based on the landmark work by Corbett and White (1979, 1981), which relates the mathematics of graph theory and topology to the storage of geographic and cartographic data. A significant feature of the TIGER system was the ability to automate production of all geographic support products from a single integrated digital database (LaMacchia 1990). The TIGER system was built using a Database Management System (DBMS) called TIGERdb. Geography Division staff developed the TIGERdb DBMS, along with associated software to support spatial functionality and indexing, and automated maintenance of the topological data (Boudriault 1987). The use of topology, stored persistently in the database, improved the efficiency of spatial data retrieval and data storage and helped enforce implementation of data integrity and consistency rules.


MAF/TIGER Redesign

While the TIGER system has served its purpose well, the time has come for a change. The homegrown database system does not integrate well with current commercial off-the-shelf (COTS) tools and Web technology. It is cumbersome to change, difficult to learn for new developers, does not allow multi-user access, and is not accessible via a standard query language.

The MAF/TIGER Redesign project, which is one of the five components of the MAF/TIGER Enhancement Program (MTEP), will result in a seamless national integrated database that will utilize commercial database software in place of TIGERdb. To the extent possible, homegrown applications will be replaced by COTS software.

Oracle's relational database management system will be used to store all data in the new database. Spatial data management, including spatial operations and indexing, will utilize Oracle Spatial. The Oracle Spatial Topology Data Model will store and manage geographic features and the topological data structures on which they are built.

MAF/TIGER Functionality and Content

The MAF/TIGER system provides geographic services in support of Census Bureau surveys, censuses, estimates, and partnership programs. These services include:

Provision of these services requires ongoing maintenance of the MAF and TIGER databases, which are being combined as a part of the redesign. The content of these databases is described in the following sections.

TIGER

The TIGER Database includes geographic features such as roads, railroads, geographic areas, landmarks, waterways, and other geographic information that is needed to support the programs of the Census Bureau. The database serves as the repository for all of the geographic information needed for census and survey data collection, data tabulation, data dissemination, geocoding services, geographic and statistical analysis, and production of maps.

MAF

The MAF is designed to be an accurate, national residential address inventory and supports data collection efforts and questionnaire deliveries to each residence. It provides for storage of a mailing address to support questionnaire mail-out, as well as location information to support personal interviews and address canvassing operations. Location information might consist of a city-style address, a latitude/longitude coordinate, a census block number, an E-911 address, and/or a textual location description.

While the existing TIGER system contains primarily spatial data, the integrated MAF/TIGER system will combine TIGER with the largely non-spatial MAF database. This mirrors a common trend in the Information Technology (IT) and Geographic Information System (GIS) industries: the integration of spatial and non-spatial data into a single enterprise data set.

Spatial Data in a Relational Database Management System

Historically, spatial data and GIS applications have been developed as stand-alone systems for spatial analysis and map production. Often, the spatial data in these "stovepipe" systems are stored in a separate, proprietary database, on separate hardware, and require proprietary data update and access tools, as well as separate archiving, system maintenance, and tuning. For many organizations, GIS projects represent a major integration challenge consuming a large portion of the IT budget. Much of this is directed towards integration between spatial and non-spatial systems (Batty 2004).

Over the last decade, however, isolated "stovepipe" GIS systems have begun to be integrated with mainstream IT and used to support location awareness in additional business applications. Database vendors have partnered with GIS vendors to make a concerted effort to ensure that spatial data can be blended seamlessly into the enterprise database (Gonzales 2000). When the Relational Database Management System is expanded to handle spatial data, core database capabilities, such as scalability, security, versioning, and replication, can be extended to spatial datasets. An open spatial database also allows for spatial enabling of many enterprise applications with associated improved functionality (Weinberger 2002).

One of the key components of incorporating spatial functionality into an RDBMS is implementation of spatial data types. Standard RDBMS attributes support data types such as character, date, and integer. In a spatial database, additional data types are required that can be used to represent point, line, and area features. In addition to data types, spatial databases must provide support for spatial indexing and clustering, as well as spatial operators (Van Oosterom 2002, Batty 2004). Spatial operators include:

The OpenGIS Consortium (OGC) recognized the importance of the integration of spatial data into the IT mainstream by standardizing the basic spatial data types and functions in the Simple Feature Specification (Van Oosterom 2002). The OGC Simple Feature Specification defines a "standard SQL schema language that supports storage, retrieval, query, and update of simple geospatial features via the ODBC API" (Open GIS Consortium 1999). Simple features are based on 2D geometry and have both spatial and non-spatial attributes.

Oracle, the database vendor selected for the MAF/TIGER Redesign, has been a leader in integration of spatial data and functionality into the RDBMS, maintaining a dominant share of the market for geospatial database management (IDC 2003). Utilization of Oracle and Oracle Spatial for the redesigned MAF/TIGER database will ensure a single database in which spatial data is stored seamlessly with the associated feature attributes and with non-spatial datasets. This will allow for an integrated and improved approach to scalability and data base management that includes replication, archiving, and tuning. It will also provide for spatial enabling of enterprise data that includes the MAF. Oracle Spatial adheres to the OpenGIS Simple Features Specification for SQL, Revision 1.0, Normalized Geometry (Oracle Corporation 2004).


What is Topology?

Topology involves the mathematical study of spatial relationships. It describes the characteristics of a geometric figure that do not change under continuous transformation. In a graph, the number of line segments, intersection points, and polygons, and their relationship to each other, are constant as the plane in which they exist is stretched or distorted. In GIS applications, topology is the means to describe, manage, and retrieve these relationships explicitly without resorting to time-consuming spatial comparisons (Ramage and Woodsford 2002).

The principles of topology are utilized to implement a system that provides for:

Some GIS applications use "persistent topology", that is, they structure the data according to topological principles so that the topological relationships are stored and available persistently in the database. In addition to the above benefits, this persistent topology approach also provides for:


Persistent vs. On-the-Fly Topology

An alternative to persistent topology is called "on-the-fly" topology, in which intelligent client applications compute topology as needed for selected subsets of the database. This can potentially provide most of the benefits of persistent topology, although it does not address the redundant data storage issue. An additional drawback can be the time that it takes to determine the topological relationships as they are needed. The choice of "on-the-fly" or persistent topology would appear to depend largely on the nature of the GIS system being utilized and maintained. The majority of processing for the MAF/TIGER system is spent on very large batch processes that run on the whole nation and utilize topology to improve performance. Continual "on-the-fly" recalculation of topology to support these programs could prove problematic. In addition, due to the high degree of interdependency between different feature types in the TIGER system, editing becomes more straightforward with a system that minimizes coordinate redundancy.

The redesigned MAF/TIGER system will use the persistent topology data structure that is part of Oracle Spatial, starting with the release of Oracle 10g. This system, called Oracle Spatial Topology Data Model, provides persistent topology to support batch or interactive applications. In addition, since it is implemented via a server-side topology engine, this solution becomes much more interoperable.

Thick, topologically aware clients are not required, so any client applications, even thin web-based ones, can be used to update the database, while the topology engine manages the spatial updates and maintains the topological data structure (Lessware 2004).


The Building Blocks of Topology1.

Topology works by dividing spatial data into low-level primitives, which form the building blocks for spatial data. These building blocks include the following components:

The edge is the central component of two-dimensional topology. It is a linear or one-dimensional construct that has a starting point and an ending point. The end points are referred to as connecting nodes. An arbitrary direction is assigned to each edge, allowing designation of one of the nodes as the Start Node, and the other as the End Node. The edge may be defined by the line segment connecting these two nodes or it may have intermediate points called vertices. See Figure 1.

Straight Line Line With Shape Points

Edge with no vertices

Edge with vertices

Figure 1. Edges

A face corresponds to a simple polygon bounded by edges, less any "holes" created by the formation of polygons within its boundaries. The formal definition of the face also includes any interior edges that have the face on both the left and right side. See Figure 2.

Three Poloygons - With and Without Holes and With Interior Edges

Figure 2. Faces

Because the face is built from edges, no coordinates are stored explicitly for the purpose of representing a face. Rather, the coordinates are stored at the edge level. In order to determine the coordinate geometry of a face, the geometries of the edges that bound the face must be retrieved.

Topologically consistent datasets have the following properties (Kainz 2004):

These rules apply to systems with full "polygonal" or "planar" two-dimensional topology, such as is used by the legacy TIGER system and the new MAF/TIGER system. Some applications use less rigorous types of topology. Examples are Network Topology, which consists of only nodes and edges, but not faces (Ramage and Woodsford 2002), or Winged-Edge Topology, which utilizes only edges and faces (Baumgart 1975).

Another component of topology that is recognized by Oracle Spatial Topology Data Model, but is not covered in the formal topology rules described above, is the isolated node. This is a point that does not have any attached edges. An isolated node might be used to represent a housing unit, a nursing home, or a mountain peak. It is commonly required to identify the relationships between these isolated nodes and other spatial features. For example, one might want to identify the closest road to a structure, or all of the point features within a face or areal feature. If the isolated nodes are not integrated into the topology, these queries can require significant spatial searching. The Topology Data Model allows isolated nodes to be integrated into a topology layer and linked to faces. For the redesigned system, consideration was given to storing the isolated nodes in a separate topology layer, so that they would not interfere with updates to the linear network, especially in regards to snapping and tolerance rules. However, the decision was made to store isolated nodes in the same layer as the other topology components. In addition to the more onerous spatial searching that would be required if isolated nodes were in a separate layer, there could be other disadvantages related to the use of the geometric update methods provided with Oracle Spatial Topology Data Model. When geometric features are moved or reshaped, it is useful to know if the features have "crossed over" housing units, causing a housing unit to end up in a different relative location to the feature. If housing units are stored as isolated nodes and integrated with the other spatial data, then Oracle Spatial Topology Data Model will be aware of this change in the relationship of the road to the housing units, and can inform the calling application, providing the opportunity to abort the change. If the housing units are just stored as coordinates or are isolated nodes in a separate layer, the geometric methods will be unaware of their location or the change in relationship caused by the move. See Figure 3.

Before and After Line Move Illustration

Figure 3. Move Edge


Topology Differences Between the Legacy TIGER System and Oracle Spatial Topology Data Model

The storage of topology by Oracle's Spatial Topology Data Model is very similar to the legacy TIGER system. There are, however, a few differences:


Topology Mechanics

A distinct advantage of using topology derives from the relationships that are maintained between the "topological primitives", i.e., nodes, edges, and faces. These relationships speed queries concerned with adjacency, connectivity, and containment of features or topological primitives. For example, from a connecting node, it is straightforward to retrieve all of the edges that connect to it, through a simple database query. Similarly, it is a simple query to retrieve the faces on either side of an edge as well as to retrieve all of the edges belonging to a face. More complex queries are also fairly straightforward, such as retrieving all the faces that share a given node or all the faces adjacent to a given face. Similar operations are available to determine adjacency, connectivity, and containment of features. Again, these are available as direct database queries, without having to resort to expensive spatial data searches.

Egenhofer, Frank, and Jackson (1989) enumerated all the possible topological relationships that can occur between two spatial entities by examining the combinations of the intersections of the boundaries and interiors of the two objects. The eight possible relationships for objects of the same dimension are:

Calculation of these relationships and other functionality is possible in a database that does not utilize topology and stores a complete set of coordinates explicitly to represent each feature. However, calculations in such a database require spatial searches and comparisons rather than more efficient database queries. With the refined use of spatial indices, such as R-trees, spatial searches are getting more efficient, but they still are significantly more time-consuming than other types of data access.

In addition to faster retrieval and more efficient data storage, topology also provides for more efficient and effective data cleansing, error detection, and data integrity. If the topology is stored persistently, storage of redundant data is greatly reduced or eliminated, which simplifies enforcement of data consistency. Topology allows for easy detection of gaps and overshoots, and helps prevent inadvertent overlap of areal features. It also facilitates implementation of a snapping and tolerance system to avoid slivers, arbitrarily close nodes, very small polygons, and so forth.


Examples of Topology Use at the Census Bureau

Topology is utilized to a significant degree by many of the applications that comprise the Census Bureau's Geographic Support System. The following are some examples.

Areal Delineation

Areal Delineation programs, such as Automated Block Numbering and Automated Assignment Area Delineation, make extensive use of topology, "walking around" nodes or faces, and examining adjacent or connected primitives to determine the prospects for adding additional territory to a candidate block or assignment area.

Interactive Update Software

A common function in interactive update software is for the operator to select a feature on the map. This requires an initial spatial search to determine the closest node or edge of the point selected, or the containing face. However, once the appropriate topology primitive has been determined, all associated or adjacent features can be readily determined via topological relationships, without having to do additional spatial queries.

Mapping

Spatial Data Matching

Spatial data matching, or conflation, makes extension use of topology. The topological primitives play a central role in feature match recognition (Saalfeld 1993). The Geography Division uses spatial data matching software to support digital exchange with local governments and other sources, and to upload data received as part of the MAF/TIGER Accuracy Improvement Project (MTAIP), which is designed to improve the accuracy of coordinate locations of the road centerline spatial features in TIGER. Permanent identifiers assigned to edges (TIGER/Line IDs) also play a critical role in MTAIP and spatial data exchange activities.

Quality Assurance

Topology can be utilized to identify and correct or report anomalies in the dataset based on entity specific rules for adjacency, coincidence, etc.

"I" of the TIGER (Integration)

The most straightforward way to represent spatial features in a GIS would be to independently store the geometry for each feature. "Geometry" in this context refers to the entire ordered set of latitude/longitude coordinates that represent the location and extent of the feature. In this scenario, features that share geometry would overlay each other resulting in redundant coordinate storage. However, independent or layered representation of the geometry of spatial features is not optimal if there is a high degree of interrelationship among features and many features share underlying coordinate strings or polygons. For example, roads, rivers, and other linear geography in the TIGER database often also serve as boundaries for geographic areas, such as places or counties. In addition, the Census Bureau manages and maintains the boundaries of over 75 different types of tabulation and collection geographic areas. And for many of these, multiple vintages must be maintained simultaneously. These areas often share portions of their boundaries with each other and/or with linear features. For example, for a given section of a road that serves as a county boundary, it would not be unusual for it to also comprise a boundary for some or all of the following:

If a layered approach were used, the independent storage of geometry for each feature would result in the string of coordinates, representing the section of road, being stored multiple times. If the shape of the road needs to be adjusted or corrected, the geometries for the geographic areas that follow the road must be rebuilt to reflect the change. Alternatively, if topology is used to organize spatial data, the coordinates for a section of road that also bounds geographic areas are only stored once, and the correction becomes more straightforward, although it is still subject to business rules that govern feature consistency.

Features

Features in the new MAF/TIGER database will be built upon the topological primitives. The feature records will contain name, class code, and other attributes, while the location information will be managed via linkage to the topological primitives. All of this linkage occurs through a central relationship table managed by Oracle Spatial Topology Data Model, which allows a many-to-many relationship between features and topological primitives. In general, the spatial features will be purely areal (comprised of faces), linear (comprised of edges), or point (comprised of nodes). However, complex features comprised of more than one primitive type are allowed. So, a river could be represented as a linear feature near its source, while changing to an areal feature as it widens.

In the legacy TIGER database, much of the attribution is assigned at the topological primitive level to nodes, edges, and faces. The redesigned MAF/TIGER makes greater use of higher level features and assigns attributes to these features where possible. However, certain attributes vary over the extent of a feature and are more appropriately stored at the topological primitive level. Examples of such edge attributes are the Permanent Edge Identifier (TIGER/Line ID), Number of Lanes (for roads), and Track Type (for railroads). For faces, attributes at the primitive level include Internal Point (of the face), Permanent Face Identifier, and Land/Water Flag. Since Oracle Spatial Topology Data Model manages the creation of edges and faces, these topological primitives are not readily modified by the user, in the event that it is necessary to store attributes at the primitive level. For the redesigned MAF/TIGER implementation, the Geography Division has utilized Oracle's Feature Management API to create auxiliary edge, node, and face "feature" tables that mirror the topology primitive tables. The staff also is developing software to manage the maintenance of the attributes on these mirrored node, edge, and face tables. This maintenance software is invoked whenever Oracle Spatial Topology Data Model initiates updates to the node, edge, and face tables, thereby keeping the mirrored tables consistent with the base topology and assuring that the attributes on these mirrored tables are maintained in real time. Business rules manage the attribution on these mirrored tables, including assignment and tracking of permanent IDs (e.g., TIGER/Line ID) whenever edges are split or merged.

Hierarchical Features

Areal features are represented as groups of faces and the faces are, as stated previously, represented by their bounding edges. However, areal features are often hierarchical in nature. For example, states are comprised of counties, and counties are comprised of county subdivisions. In order to improve efficiency and performance for these hierarchical features, Oracle Spatial Topology Data Model allows geographic areas to be defined in terms of other geographic areas, rather than defining them directly in terms of faces. See Figure 4. For these hierarchically defined features, the same API is used to query and manipulate them that is used for the non-hierarchical features. The fact that they are stored differently is transparent to the developer.

Geographic Hierarchical Layer Illustration

Figure 4. Geographic Hierarchies

Product Creation Database

The MAF/TIGER database, also know as the Transaction Database, will be optimized for batch and interactive updates, with a minimum amount of redundant data. However, such a database is not necessarily optimal for data extraction, mapping, and creation of other geographic products and services. For example, in the Transaction Database, consider a geographic area that is stored as a collection of faces. In order to determine the "geometry" of it's boundary, the application software must:

From the developer's perspective, this is simple because the Oracle Spatial Topology Data Model API provides a single call to do it. However, the processing time to determine boundaries for large or complex geographic areas can be significant. It is inefficient if every application that needs this boundary has to re-calculate it, especially if the boundary has not changed. That is essentially how the current system works and it can be problematic, especially for mapping applications (Trainor 2003). For this reason, a Product Creation Database will be generated that contains the information from the MAF/TIGER Transaction Database, plus additional calculated information, including explicitly stored boundaries for geographic areas, as well as explicit geometric representations of linear features such as roads. The Product Creation Database will be replicated or created from the Transaction Database on a regular basis. The exact nature and format of the Product Creation Database has not been determined and it could depend on the selection of COTS application tools for mapping and other applications.


TIGER® is a registered trademark of the U.S. Census Bureau.


References

Batty, Peter, 2004, Future Trends & the Spatial Industry, Part One, http://www.geospatial-online.com/geospatialsolutions/article/articleDetail.jsp?id=101548, accessed 11 Oct. 2004.
Baumgart, Bruce G., 1975, Winged-Edge Polyhedron Representation for Computer Vision. National Computer Conference, Stanford University, Stanford, California. http://www.baumgart.org/winged-edge/winged-edge.html, accessed 11 Oct 2004.
Boudriault, Gerard, 1987, Topology in the TIGER File. Proceedings of the 8th International Symposium on Computer Assisted Cartography (Auto-Carto 8), Baltimore, MD.
Corbett, James P., 1979, Topological Principles in Cartography. Technical Paper No. 48, U.S. Bureau of the Census.
Egenhofer, Max J., Andrew U. Frank, and Jeffrey P. Jackson, 1989, A Topological Data Model for Spatial Databases. Design and Implementation of Large Spatial, Lecture Notes in Computer Science, 409, 271-286.
Gonzales, Michael, 2000, Seeking Spatial Intelligence, 3(2), http://www.intelligententerprise.com/000120/feat1.jhtml, accessed 11 Oct. 2004.
Hoel, Erik, Sudhakar Menon, and Scott Morehouse, 2003, Building Robust Topologies. In Advances in Spatial and Temporal Databases. Proceedings of the 8th International Symposium on Spatial and Temporal Databases. SSTD 2003. Santorini Island, Greece, Springer-Verlag Lecture Notes in Computer Science 2750.
Kainz, Wolfgang, 2004, Geographic Information Science, Version 2.0, http://www.geografie.webzdarma.cz/GIS-skriptum.pdf, accessed 11 Oct. 2004.
LaMacchia, Robert A., 1990, The TIGER System. 1990 Exemplary Systems in Government Awards Competition, Urban and Regional Information Systems Association, Edmonton, Alberta, Canada.
Lessware, Seb, 2004, Comparing Topology Management Techniques, http://www.geoplace.com/gw/2004/0401/0401tch.asp, accessed 11 Oct. 2004.
Marx, Robert W., 1986, The TIGER System: Automating the Geographic Structure of the United States Census. Government Publications Review, 13: 181-201.
Open GIS Consortium, Inc., 1999, OpenGISÒ Simple Features Specification for SQL Revision 1.1.
Oracle Corporation, 2004, Oracle® Spatial Option and Oracle® Locator Datasheet - Location Features in Oracle Database 10g, http://www.oracle.com/technology/products/spatial/htdocs/ data_sheet_9i/10g_spatial_locator_ds.html, accessed 11 Oct. 2004.
Ramage, Steven and Peter Woodsford, 2002, The Benefits of Topology in the Database, http://spatialnews.geocomm.com/features/laserscan2/, accessed 11 Oct. 2004.
Saalfeld, Alan, 1993, Conflation: Automated Map Compilation. Center for Automation Research, CAR-TR-670, (CS-TR-3066), University of Maryland, College Park.
Schneider, Markus, 2002, Spatial Data Types: Conceptual Foundation for the Design and Implementation of Spatial Database Systems and GIS, http://www.cise.ufl.edu/~mschneid/ Research/Tutorials/TutorialSDT.pdf, accessed 11 Oct. 2004.
Trainor, Timothy, 2003, U.S. Census Bureau Geographic Support: A Response to Changing Technology and Improved Data. Cartography and Geographic Information Science, 30(2), 217-223.
Van Oosterom, Peter, Jantien Stoter, Wilko Quak, and Sisi Zlatanova, 2002, The Balance Between Topology and Geometry. Symposium on Geospatial Theory, Processing and Applications, Ottawa.
Weinberger, Jason, 2002, The Spatial RDBMS in the Enterprise, http://www.directionsmag.com/ article.asp?article_id=259, accessed 11 Oct. 2004.
White, Marvin S., Jr., 1984, Technical Requirements and Standards for a Multipurpose Geographic Data System. The American Cartographer, 11(1), 15-26.

1In this paper, the terms node, edges, and face are used to describe point, line, and area topological primitives. This is the terminology used by the OpenGIS Consortium and Oracle Spatial Topology Data Model. The legacy TIGER system uses the terms 0-cells, 1-cells, and 2-cells, but to avoid confusion, the newer terms are used, even when referring to the legacy system.return