An Operational Social Science Digital Data Library

Harvard University
Cambridge, MA 02138

June 30, 1998

Principal Investigator:

Gary King, Department of Government

Co-Principal Investigators:

Dale Flecker, Harvard University Library
Nancy Cline, Harvard College Library
Sidney Verba, Department of Government and Harvard University Library

Co-Investigator:

Micah Altman, Harvard-MIT Data Center

Summary

This is a proposal to develop a Virtual Data Center (VDC) for quantitative social science data. The VDC will make a vast amount of social science data available to a wide range of types of users, varying from experienced researchers seeking data for advanced research, to undergraduates writing term papers, to citizens seeking numerical answers to specific questions. The VDC will not only make data available for use, it will provide technical and organizational means of capturing new data-sets for the scholarly community and, thereby, provide a live, growing, and evolving resource.

Our current prototype, now in operation at Harvard and MIT, implements a production special-purpose digital library service for social science data. It automatically, fills orders, delivers data, enables exploration, subsetting and conversion of data, and sychnronizes our holdings with remote holdings at ICPSR.

With NSF’s help, we will take this to the next level: (1) Generalizing the software infrastructure and interfaces, and adding an entire middle layer. (2) Using formative user studies to refine and extend the interfaces (3) develop a modular design which allows free alternatives for all major software components

These changes allow us to directly accomplish the following: (1) link together multiple (distributed) collections of social science data, each using our system, (2) interoperate, to an extent with other digital library services (3) freely distribute copies of the software.

This also will serve as an applications testbed, and as a way to test and adapt previous digital library research to the rigors of a production environment. Many issues such as naming, property rights, and payment are hard problems that are as yet unsolved. Indeed, a full solution to many of these problems can only come about when communities as a whole adopt standard approaches. We do not expect to solve them here, but we will create a interim solution for social science data that explores how a real production system can begin to address such problems, using insights from previous digital library research. This interim solution will be the one of the first to address a number of digital library issues in a production environment, and so might be used as a production framework for more complete solutions, as technologies for naming, metadata, payment and other services develop

This work will also put us in a position to extend the project in two major ways: combining journal articles and data within the same system, and doing large-scale user studies.

The VDC will make a vast amount of social science data available to a wide range of types of users, varying from experienced researchers seeking data for advanced research, to undergraduates writing term papers, to citizens seeking numerical answers to specific questions. The VDC will not only make data available for use, it will provide technical and organizational means of capturing new data-sets for the scholarly community and, thereby, provide a live, growing, and evolving resource.

  1. The Need
  2. The need to be met by the VDC can be seen from three perspectives: that of data users, data producers, and data managers.

    1. Data Users
    2. This is not a project in which information will be made available in the hope that some one will want it. The demand for social science data exists, and will only grow with easier availability. The use of data to researchers is obvious, but students and citizens also need to access to data if they are to understand the world and the issues of public policy that the nation faces. They also need to understand data to manage their own lives effectively - whether that entails managing their health or their money. When we introduced our prototype system, use of social science data at Harvard & MIT increased dramatically. Our project will bring social science data closer to students in elite universities and in community colleges, and closer to citizens through public libraries.

      1. Types of Users:
      2. The VDC will serve:

        Fact seekers: Undergraduates, for term papers, and citizens for general information, need answers to specific numerical questions. Graduate students and advanced researchers also often need specific facts.

        Teachers and students: statistics courses and courses that use statistics in substantive areas need access to data sets for class use. Many such data sets exist, designed for pedagogical use. The rich array of data available through the VDC will allow students to use materials close to their varied substantive interests -- and will also teach them something about the valuable skill of data location.

        Advanced researchers: Graduate students and faculty, of course, are prime users of quantitative data. needing easy access to a vast array of data sets for analysis.

      3. Types of Use:
      4. Finding data: A vast amount of social science data is available. Some if it, such as data deposited in national data archives, like the ICPSR, are not difficult to find, if you know it is there. For data created by individual researchers or even private concerns, nothing ensures that the data will appear in searches or even exist in any form after a few years. The VDC will enhance the availability of the leading data sets, such as Census data, the National Election Studies or the General Social Survey, that are already accessible through consortia and other well run repositories. But our goal is to develop a system that encompasses the full range of data from large and small studies, much of which is in primitive form. These small studies are at the heart of much research, but are often used only once because they cannot be easily located or used. Built into the project will be the development of a technical and organizational capacity to capture and make accessible such data.

        An Initial Look. Data users -- whether novices seeking a few facts or a simple data set for a class project or advanced researchers with more complex needs -- require a quick means to "browse" data sets to see if they serve their needs. They need simple frequency distributions, cross-tabulations, or scatter plots. But data come in many formats; they are often hard to browse, access and use. A simple query about one data set may take an extraordinary effort. Another goal of the project, is to make such preliminary inquiry easy.

        Acquiring the data: Users need to acquire a full data set, subset it by choosing relevant columns and rows, and convert the data to the format of their statistical, database, or spreadsheet program. This too is difficult because of the variety of formats.

      5. The Need for Speed

      Whatever use of a data set is made, access needs to be fast. Until very recently, the only way to get a data set that was not locally available was to wait for it to arrive in the U.S. mail on 9 track tape, a process that would normally take 4-6 weeks. And it would arrive in some strange format. This slowed the researcher and made original research in class projects virtually impossible. Recently, the ICPSR and other organizations have allowed selective electronic access to their data by a single (normally librarian) representative at each university. It appears, however, that in many places the software necessary to make this an easy transition is not available and customer service has not improved proportionately.

    3. Data Producers
    4. The VDC will make it possible for data producers to:

      1. Make Data Available.
      2. Many data sets are not transferred to archives because researchers are reluctant to make the effort needed to make data available. Designed as it is for heterogeneous data sets, the VDC will make such transfer easier.

      3. Maintain Control Over Data
      4. Individual researchers and others who create data in the course of their work are often willing to provide data upon request so long as they retain control over the source. Ask for the data directly and you are welcome to it. But these producers do not wish to put their data in the public domain. Since individual researchers are not professional archivists, data like these tend to disappear. The largest data collections, like the National Election Studies or General Social Survey, are routinely deposited in the ICPSR; however, only a small proportion of the data created in the course of research get deposited in archives. Some unarchived data appear on the web, they are not properly indexed or cross referenced.

      5. Preserve Data

      The VDC will enable data producers to share and also preserve the data used for individual research articles.

    5. Data Managers.

    The VDC is not aimed at replacing venerable national data archives like the ICPSR, but rather complementing them and helping them do their present job even better. The VDC will remove a large customer service burden from the archives by automatically handling data acquisition and distribution, sending notifications of data updates, handling data conversion and documentation, and maintaining metadata in a consistent format. When VDC systems are widespread, data archives will be able to focus their scarce resources on critical issues of preserving data and furthering the science of data archiving: storing archive copies of worldwide data, enhancing large data collections, creating new data and documentation formats, and developing tools. Data archives will be able to automatically "crawl" the Internet, searching for new data sets, copying them and preserving them against loss. The data archive staff will be able to devote more time to preparing and enhancing valuable data sets, and developing tools and standards.

  3. Meeting the Need Through the Virtual Data Center.
    1. Design Principles
    2. The VDC is a system to be used by a wide range of institutions in relation to a wide range of data. It is not meant only for those institutions at the cutting edge of technology or only for those data sets formatted at the highest levels. Our primary goal is to produce and sustain an operational digital library that is easy to use, easy to adopt, and scalable. Universities should be able to use the system to create "main" libraries with many thousands of data-sets. Individual researchers should be able to acquire software for all core features at negligible cost, and to easily open their own "branches" of the digital library, in which they would share a few sets of data from their own research. Moreover, unsophisticated users should have no difficulty finding the data held in "main" or "branch" libraries alike. Because our focus is on operations and services, we do not offer radical new designs, standards, or algorithms, but instead propose to borrow as much as possible from previous digital library research, to use open standards and, when available and robust, free software.

      Some data projects are developing general standards for linked data and codebook, work like this is exceptionally important. Data provided in these high end formats are more accessible and more interoperable than current formats, and provides the basis for meta-data standards. We will work closely with these projects, such as the document-type definition (DTD) codebook project sponsored by the ICPSR, and the Digital Library Federation's Social Sciences Databases project.

      As a production system, we are strongly concerned with handling social science data in its present state, as data that are in accessible standard forms and data that are in less standard forms. One can think of the data on a continuum. At one end are data with high-quality coding. We will provide the highest level of service for these data. At the next level, data that are submitted in the format of one of several standard statistical packages, such as SPSS or SAS, will be available for automatic subsetting, format conversion, analysis, and distribution at the level of the variable or observation. Data in nonstandard statistical formats could be accommodated by continuing to expand our set of "standard" formats, by converting them to more standard formats, or by treating them as files. At the far "low" end of the continuum would be data sets in ASCII files with "README" files for documentation -- basically files with unknown contents. For this last case, we could not subset or convert (without special coding), but we would still be able to provide searching and various activities at the meta data level.

      Once the VDC is in place, we believe users will more quickly perceive the differences among the various types of data documentation, and will start to demand higher level coding. Thus, the VDC should not only complement high end strategies, but it should also help encourage data producers to start following these guidelines.

    3. What the VDC Will Do?

    The VDC will provide an institutional mechanism for capturing data, it will allow users to locate the data, and allow them to access and use the data. The ultimate goal is a user friendly, large-scale network of social science data. How would this work? Consider a university with a few hundred unique data sets, perhaps from local researchers or dissertation students, in the position of setting up a new data center, perhaps as part of its library. The ICPSR or other organization might offer to take their unique data sets, but that is often not a popular proposal for many data centers who wish to continue to offer a unique product. As an alternative, the VDC project will offer the following. First, we will provide a free distribution and a list of (relatively inexpensive) hardware to purchase. The distribution will be easily self-installable. Once this university has paid the appropriate access fees to the national data archives, and chosen a system (from the options the VDC provides or others) for authentication of its users, they will have the same services available to all their students and faculty as Harvard and MIT have now: They can type in key words, search across all the data sets in several national data archives, view abstracts of data sets, look at lists of variable names, and when they are available, read codebooks on line. They might then run some descriptive statistics for a few chosen data sets (if the data are not available locally, the system will transparently retrieve the data in the background from the ICPSR or other organization).

    An undergraduate might stop here, but for a graduate student or senior researcher, the VDC will also subset the data set (by the chosen rows or columns), convert it to the appropriate format, and deposit it on the user's hard disk. This would all be available instantly and automatically. Once this basic service has been created, this new data center may also take its unique data sets and deposit them in a special subdirectory on the system. Without any additional preparation, the VDC system will automatically recognize these data sets. They will become searchable, subsettable, analyzable, and convertible as are the other data in the system.

    The VDC project will also be far more than porting a product now available at Harvard and MIT. For example, it can handle unique data sets that might be available at other universities. We will expand support for the federation of digital libraries through a publish-subscribe interface, along with publisher proxies for other services. High performance will be possible through caching and optional mirroring of meta-data and digital object repositories, allowing both simple sharing and hierarchical distribution of data. This will allow administrators of the VDC to create collections that encompass remote sites, and for universities to unify multiple collections within their departments. Once the VDC is installed at a site, it will be able to connect to the network of existing VDC sites to share metadata. This means that any user at any site doing a simple search can choose to automatically explore all the data in the ICPSR and other national archives, all local unique data, and all the unique data at other VDC sites. In return, the local site can choose to make their metadata and data available to others through the same system. Our extensive conversations with those generally reticent to make data available by depositing it in the ICPSR tell us that they would willing and in most cases eager to provide data in this way.

    To facilitate this process, those with unique data sets will be able to choose one of several methods of access, for example – unrestricted, signing an authenticated "electronic guest book", writing an explanation of the desired use and asking for permission, or entering in a credit card number. This gives the provider control over the master copies of the data (which would still reside locally), visibility in providing the data to the scholarly community, and access to the vast array of unique data at other sites. Each additional unit that connects to the VDC will make the whole network more valuable. We will even provide a personal version of the VDC that will enable a scholar to hook their individual web site into the system, thus capturing for the scholarly community the data being made available by the fast growing and relatively undisciplined practice of putting data up at isolated web sites but not in any unified catalog.

    We have found great interest in data centers around the country in our system, and we believe that it will be quickly adopted. Many will use our software directly. But we plan to design the software so that it is highly modular, and fully open and extendable. We imagine many "snap-in modules" that could be written to improve the system, and we hope to encourage the user community to contribute them. For example, we have experimented with a quick way to get a sense of those data sets that are organized geographically by automatically generating maps colored in by chosen variables. (The Harvard-MIT Data Center has much experience in this field, and currently in conjunction with the Harvard Map Collection, has begun a two year "Geospatial Liboratory [sic]" project to create a separate digital library of geospatial data that can be linked to the present project (http://data.fas.harvard.edu/hdc/hmdcproj/liboratory.shtml). Other modules might include specialized statistical software or systems to handle unusual types of data organizations or formats. These modules can be written to extend our system or even to replace parts of it. In fact, a few of the most sophisticated university data centers, with their own software in place already, might wish to avoid our software altogether. If this happens, these few sites will still be able to contribute since we will also provide protocol gateways. Our specific software is a proof of possibility, and will be of use to the vast majority of sites, but it is not necessary to be part of the VDC system.

    Once the scholarly community is using the system, we hope to open access to commercial data providers, under the condition that they write a module that snaps their data into our system. Once that is done, users could purchase commercial data on our system and would no longer need to worry about unique data formats and specialized programs; they would get it in the same automatically convertable, subsettable, analyzable formats as the rest of VDC data. (Conceivably, there could be a small tax on data providers to support the continuing development of the system.)

    We have very close connections with the ICPSR, its director, and the Council. Gary King gave a complete presentation of the Virtual Data Center project to a meeting of the ICPSR Council, on which he serves. This was followed up by much feedback and encouragement. As the largest ICPSR data user, the Harvard-MIT Data Center is well acquainted with the staff of the ICPSR, and we have received many helpful suggestions along the way. We plan to continue and reinforce these contacts, so the development of the VDC serves to strengthen the ICPSR.

    Eventually the ICPSR could perform this activity for the scholarly community, as they are one of the only organizations with professional practices such as cataloging, verifying, and archiving with off-site backups. Permissions would need to be secured from local contributors, with agreements written regarding future use of backed up data (for example, the ICPSR might agree to only distribute data if the local site vanishes).

  4. What Has Been Done?
  5. We do not begin these studies from scratch.

    1. No Digitization Needed
    2. The main previous work on which we draw is the vast amount of digital quantitative data that exists. The point is so obvious that it might be missed: the quantitative data in our system require no conversion to digital form. One of the advantages of this project is that it adds a large amount of value to material already in digital form.

    3. Drawing on the Work of Others.
    4. Our goal is to create a system that integrates many tools and applications; we do not wish to develop new tools and applications when we can apply existing one. The VDC will, as much as possible, be developed using robust pre-existing tools.

    5. Our Previous Work:

    Most significantly, we can build on work we have done already. We have developed a prototype at the Harvard-MIT Data Center which is being used extensively to automatically collect data from remote archives, and to deliver a large and varied amount of data to a scattered and heterogeneous set of users. Our system will search across all available data sets, at Harvard, MIT, and several national archives. It will automatically subset and convert data to chosen formats, and it scales up quickly so that new data sets in a large variety of standard formats added to the system will also be instantly subsettable and convertable. Our prototype (http://data.fas.harvard.edu/), and has greatly accelerated data-based research and teaching within Harvard University and MIT. This data center is now one of the most heavily used academic data centers in the world. In 1997, it served over 10,000 data sets, and automatically answered over 100,000 queries from all over Harvard and MIT.

    We now want to build on our experience to make these resources more generally available. In fact, this process has already started. We have an agreement with the Henry A. Murray Research Center (http://www.radcliffe.edu/murray/) to make their unique holdings available to qualified investigators through this system, on an experimental basis.

  6. VDC Design and Development
  7. No tool duplicates the core functions of the VDC, but the VDC will be, as much as possible, created from robust pre-existing tools. This section discusses the current system, the design goals and principles behind the VDC plan, the primary features of the VDC, the functional components of the VDC, and development methods and schedule.

    Our current system has been successful in enabling the Harvard and MIT communities to search for, obtain data through world wide web. Some of this data is produced on-site, but most is retrieved from the ICPSR; our system manages automatically the retrieval and caching of ICPSR data, and the the process of keeping our metadata and data holdings consistent with the holdings of remote archives, as studies are updated, added and delete. It supports simple authentication through the use of Harvard and MIT i.d. numbers. It also enables summarize data, subset it, and convert it to their preferred format – all while on-line.

    Our current system was produced in a service-oriented environment, and has a very simple architecture. It comprises a simple two-tier client-server architecture, with ad-hoc extensions to synchronize automatically with other remote data collections. Description and normalization meta-data, naming, and communication with other data archives are all ad-hoc. In addition, some of the components of the system are based upon proprietary commercial software, such as SPSS.

    The current architecture is too simple to provide a general framework for either the treatment of digital objects, or the management of federated collections. The VDC will be a significant redesign and re-implementation of this system. First, we will re-implement this prototype on a foundation of a general digital object infrastructure: this will involve re-conceptualizing the design in terms of objects, generalizing the data-structures and interfaces we use to handle social science data so that they are flexible enough to be used for other types of digital objects, and building a middle layer into the system to enable interoperability. Second, we will incorporate free, open, tools to provide alternatives to the commercial products we use now. Third, we will extend our current digital object preparation tools, to aid researchers to migrate their data into the digital library. Development will be guided by user studies of the current services.

    1. VDC Features

The initial VDC features comprise four categories: data preparation, data access, user interface, and interoperability. Approximately half of these features are provided, in some form, by the current prototype and are marked with a "*". We list the primary features of the VDC in Table 1, and then we outline the design below.

These features challenges with which any large-scale production system that operates in an open environment has to come to terms. Yet many features, such as naming, property rights, and payment, raise hard research problems that are as yet unsolved. Indeed, a full solution to many of these problems can only come about when communities, as a whole, adopt standard approaches. We do not expect to solve these here, but we intend to create an interim solution for social science data that incorporates insights from previous digital library research to explore how these problems can be approached in a real production system. This interim solution will be one of the first to address a number of digital library issues in a production environment, and so might be used as a production framework for more complete solutions -- as technologies for naming, metadata, payment and other services develop. We also expect to produce a framework in which we can develop services for other types of digital objects, such as journal articles, and which will allow us to launch major user-studies.

Category

Features

Digital Object Preparation/Intake

  • Naming: Uniquely identify digital objects
  • Aids for preparing and converting data in common formats.*
  • Aids for preparing and converting metadata in common formats.

Digital Object Management

  • Repository management (addition, deletion, modification of objects).*
  • Metadata queries.*
  • Cacheing and Mirroring: performance enhancements for repository and metadata management, location of digital objects

User Interface

  • Views/browsers: including features to display*, summarize*, subset*, aggregate, convert* and merge data and documentation.
  • Conceptual maps: provide maps of the content of the library to users for searching and browsing*
  • Administrative interfaces: for establishing sessions, authentication* and payment, and administering collections.

Middleware and Interoperability

  • Direct support for Corba IIOP.
  • Gateway support for common protocols.*
  • Facilities for the integration of multiple collections.
  • Publisher-proxies for the federation of heterogenous collections.

Table 1: Primary Features ("*"’s indicate features that are at least partially supported in the current version).

    1. Structure of the VDC
    2. Rather than designing a specialized digital library for social science data, we will build a general digital library that provides specialized services for this data. At the same time, we will maximize the use of existing, freely-available and openly architected software.

      In order to handle digital objects in a general way, we rely on an infrastructure services for naming objects, for communication among clients and servers, and for metadata. Free statistical products and databases will be used for the actual data manipulation.

      Figure 1 shows a sketch of the design of the system, illustrating some of the protocols that are likely candidates for incorporation into the system. This design will, of course, change during the course of the project – often, an initial design serves simply to make clear what does not work. A significant of the project will be to refine the design through multiple rapid prototypes.

      User Interfaces. Initially, users will probably interact with the system using web browsers. Through an interface that combines HTML, XML, and Java applets, users will connect to web-servers that will act as proxies to the middle layers. Together, these will obtain objects and metadata through the middleware layer from repositories, and enable the user to display, summarize, subset, aggregate, convert and merge data and documentation.

      Naming services. Naming services are fundamental to the digital library. Digital objects must be uniquely identified, and this identification should not change if the object is moved to a different location or if the repository in which the object resides changes location. A number of experimental schemes for uniform location-independent naming exist: including "PURLS" and "URN"’s , the CORBA naming service , and the Handle System ® (also see http://www.handle.net).

      Handles and PURLS are the most well developed of these naming systems, and may be the best suited towards the needs of digital libraries. However, naming systems have a number of limitations that we will have to address. First, naming schemes have either been implemented in a limited fashion (like handles) or not yet implemented at all (like URNS); it is unlikely that in the near future a universal scheme will be adopted across data archive. Second, most naming schemes do not distinguish among multiple copies of the same digital object, so a proxy system will have to be used with the handles resolution system to direct user’s to close instances of each object. Third, outside data collections, such as ICPSR, will probably not be adopting handles or any other true naming service in the near future – proxies will have to be devised for each external service (we plan to create such a proxy for ICPSR holdings, based upon the study #’s that ICPSR assigns to data collections). Fourth, because the system is likely rely on CORBA for its middle-tier, a service will need to be provided for mapping handles into CORBA names. In our project we will explore the use of handles and other naming schemes for objects.

      Middleware. Middleware provides a foundation for the interoperation of diverse clients and servers. The two largest competitors in this arena are DCOM and CORBA. DCOM is a closed standard developed and controlled by Microsoft . CORBA is an open standard, with a number of open implementations, that may be better suited for the development of a free, extensible system. CORBA provides particularly flexible object-oriented middleware services, and has been used successfully in a number of other digital library projects, but has not been used in large-scale production systems. Stanford’s Digital Library Interoperability Protocol serves as another protocol layer on top of CORBA that further defines how digital library elements can interoperate.

      There are three significant limitation of CORBA that we will have to avoid. The first is that many of the secondary CORBA services are still in a state of flux, so we will only be relying upon those core CORBA services that are relatively stable. The second limitation is that although external connections to CORBA services are well defined, the programmer API’s for CORBA are highly dependent on the CORBA implementation used ; so, we will use "bridges" to insulate the collection management services from these API’s. The third limitation, for the immediate future, is that other digital libraries with which we will want to interoperate do not use CORBA (e.g., ICPSR’s digital library uses its own protocols); so we will construct or adopt proxies for ICPSR’s home-grown protocol as well as for other common protocols.

      Many simple automated clients may not understand CORBA, or other potential middleware choices. So, we will also explore the use of limited gateway interfaces for simple automated clients. Many libraries use Z39.50 to share metadata information, and we will explore the use of Z39.50 as a gateway into our services. HTTP is also attractive, because of its wide use, but the structure of HTML limits the services that can be provided by automated client. The emerging XML standard offers better prospects for distributing structured data using HTTP.

      Metadata. Most existing schemas for social science metadata data are relatively simple, and can be efficiently stored as relations, and efficiently queried using boolean operators in an SQL89-type syntax. There is no universal metadata schema for social science data, but much of the Dublin core is applicable, although not comprehensive, and would provide a reasonable basis for users wishing to find sets of data of interest. The initial implementation, will map metadata added to the system into the Dublin core, will support queries using SQL89. We will also explore using the Stanford STARTS protocol in this system.

      There are a number of limitations to this approach. First, significantly more detailed metadata are required for data services more complex than location and delivery of sets of data. Second, other digital objects will require different, and more complex metadata. Third, STARTS (although it is extensible) and SQL89 queries are inadequate for some knowledge domains.

      The Dublin core provides general metadata that could be used to find datasets of interest, but does not attempt to provide detailed descriptions of the components of digital objects. In order to provide data-exploration services, data viewers, subsetting and conversion of data formats much more metadata is needed – typically at the variable level. Special challenges are raised by services that attempt to automatically aggregate and merge different data collections into coherent extracts (e.g., merging census data collected at the block-group level with public opinion poll data collected in congressional districts and economic indicators for metropolitan areas). These services require extensive metadata at the variable level to ensure correctness (see for a study of some of these issues). There are a number of avenues we will explore to address this situation. First, ICPSR and DLF are developing an SGML DTD for datasets that would provide much of the needed variable level information (http://www.icpsr.umich.edu/DDI/codebook.html), and we will develop tools to read these DTD’s. Second, many data formats for statistical programs (such as SAS and SPSS) contain embedded variable level metadata, or variable level data implicit in the study-level data, which we will capture, normalize and convert. Third, we will experiment with ontologies for variable level metadata to support the correct merging of heterogenous sets of data on common fields such as time and geographic location.

      Social science data categorizations will change, and other digital objects will have other metadata schemas that are better suited to their domains. A framework is needed that can be used to support metadata for a variety of objects as the system develops. In order to avoid being wedded to a particular schema, we will explore the use of Warwick Framework, which acts as a container for different "packages" of metadata, such as the Dublin core . No one has implemented this framework, as yet, but it may provide a means to encapsulate different metadata schemes.

      In addition to requiring different metadata, other domains may require different query semantics. For example, spatial relations are often used to query geospatial databases, but can only be incompletely expressed in simple SQL89 dialects. We do not plan to support other query languages in the first round, but we will provide a mechanism for specifying the language associated with each section of a query, so that the system can be extended to handle queries crossing multiple languages.

      Data manipulation. Most social science data schemas can be modeled easily as relations techniques, once the peculiarities of proprietary data formatting are transcended. Data that is inducted into the system will be converted into a common base format, and freely available database and statistical packages will be used to provide conversion, summarization, subsetting, aggregation and exploration services.

      Distributed Data: It is often inefficient or inconvenient for digital objects to have a single location. But allowing objects to exist in multiple locations raises a number of issues: How to locate the "closest" objects? How to maintain consistency and completeness? How to perform updates? One approach will be to explore a simple caching and mirroring of metadata and/or data, using a publish and subscribe model for updates. We will investigate the use of both simple single-level mirrors and more complex hierarchic collections of data.

      Access Services: In a production system, property rights must be honored, yet there is no uniform way of describing these rights, nor a universally accepted payment method. Still, payment and rights management are happening on the Internet now. Our goal will be to develop or borrow a simple set of authentication and payment mechanisms, informed by theoretical research, and maintain simple property rights metatdata, so that the system can be used in the real world.

      Figure 1: One Potential Architecture for the Digital Library

    3. Integrating Free Components into A Flexible Portable System
    4. When Edison invented the electric light, he also had to design a system that would deliver electricity to his customers, and found companies to manufacture it ; even ten years ago, digital library development faced similar challenges. In the last five years however, there has been unprecedented growth in the scope and technology of the national information infrastructure . Fortunately, there now exists a much larger base of existing open software, and a significantly greater convergence on common protocols, than existed ten years ago.

      It is now possible to develop upon completely free and open systems, using open operating systems such as Linux and FreeBSD, multi-platform and platform-independent development languages such as Perl, Python, and Java. The GNU project (http://www.gnu.org) has established a large base of applications and utilities, many of which run on both Windows NT and a variety Unix of platforms; these include file management utilities, databases and statistical software. Users will be able to access the system through free web browsers (Netscape has now made the source code for the next generation of its browser freely available.) and Java applets. The VDC project will take advantage of this base of open software components to create a flexible digital library system that is free and portable.

      We will use a combination of multi-system tools, and multi-platform languages to develop code designed to be platform independent. We realize, however, that even supposedly platform-independent tools do not actually behave identically on all platforms. Two operating systems are likely candidates for testing and devlopment, each of which has distinct advantages. The primary advantage of Windows-NT is that the various flavors of Windows are in common use, and Windows-NT will continue to gain popularity as Microsoft phases out Windows95 and Windows98. The primary advantage of Linux is that it is a completely free, open, and relatively robust and flexible operating system. Producing a Linux tested version of the VDC will mean that all features will be open, even those implemented by the operating system.

      Although much current data carries licensing restrictions, significant sets of data could be provided without restriction to the general public. In addition to "replication" data for individual articles, which can usually be distributed to the public, much government data could be shared more widely. Such data includes studies, now held at ICPSR, by the National Center for Health Statistics and the Department of Justice. Furthermore, much government data is available digitally at government documents libraries, but is not available on-line. We will work with Harvard’s government documents library to make a selection of its digital holdings, now available only on CD-ROM, available via the VDC system.

      Moreover, the availability of the VDC system may encourage researchers to share their data more widely. For example, the VDC prototype at Harvard, has enabled us to make the massive, new, NSF-supported, Record of American Democracy data publicly available (available from http://data.fas.harvard.edu/ROAD/).

      One of our CO-P.I.’s, Gary King, has a long experience developing highly complex public domain software (available from http://GKing.Harvard.Edu/stats.shtml). He has developed what is now widely used software for evaluating redistricting and forecasting election results that have been used by academics, public officials, judges, and partisans in many legislative redistricting cases. He has also produced software for the statistical analysis of event counts that has been widely used in the academic community. His new software on ecological inference is being used across a range of academic disciplines, as well as in government and private industry.

    5. Rapid Prototyping
    6. A guiding principle of designing complex systems is "plan to throw one away" . Following this principle, we will employ a rapid prototyping model for development – building and distributing internal versions of the software and external beta versions early, and releasing new versions often. Source code will be open, and comments, feedback, and contributions to the project will be solicited early.

      Development will proceed in several phases: design, alpha implementation, testing and quality assurance, beta testing, and major release. During the design phase, we will develop a complete formal description of each module and its methods, and of all interfaces. The alpha implementation will implement the objects and interfaces, and test them internally at Harvard and MIT. In each phase of the process, we will seek comments from the general academic community via Internet RFC's, conferences, and beta testing programs. The beta release will be an open release, soliciting comments from many sites at other universities.

      Harvard and MIT are ideal test-beds for this development because they are microcosms of the larger research community: They have, together, a huge community of users of quantitative data from all fields and disciplines; and data is distributed (physically and politically) throughout both institutions, as it is in academia. In addition, a number of other university data centers will be able to immediately use the new VDC, and will also serve as alpha-test sites

      We are working closely with librarians from the Harvard University Library, the Harvard College Library and the MIT libraries to identify centers of quantitative data at Harvard and MIT that would serve as alpha-test sites for the new VDC. We expect that the new VDC would be used immediately by the Harvard Map Center, one of the oldest collections in the University, which has collected extensive electronic geographic data collections, and by the Government Documents library, which has extensive holdings of data on CD-ROM.

    7. Formative research with users.

Since quantitative data already has thousands of users we plan to conduct studies at the start and throughout the duration of the project, to determine how users understand and work with the system: its interface, data organization, and other features. In addition, we will explore how faculty might plan to use the system in their teaching, and how faculty, students, and citizens might work with it for research purposes. Thus, we plan that these studies will explore the following issues: (a) interface design; (b) organization of information; (c) analysis features; (d) how the data might be used in higher education courses, and how it might affect course design (e.g. assignments, student work products and the like); (e) key concepts in the teaching of the social sciences that might be especially supported by the system; (f) how use of the system might affect social features of course work, such as team projects and the like.

We will work with a variety of users in these studies in order to thoroughly understand the needs of a range of users who would benefit from the system. These variations will include levels of disciplinary expertise: undergraduate students, graduate students, faculty, citizens.

The studies will be designed to examine how people use the system, and how they understand and think about it. Methods will include individual task sessions interviews in which people will be asked to carry out various tasks with the system, thinking aloud as they are doing so. Individual in-depth and focus group interviews will explore how people might use the system for different purposes and contexts.

  1. Enhancing the VDC.
  2. The project for which we are seeking support is part of a longer range set of projects. There are two enhancements to our current work that we anticipate: systematic user studies and linkages of the data in the VDC to text. During the current project, we will move in the direction of these enhancements, though their full development and implementation will wait for subsequent phases.

    1. Additional User Studies.
    2. Computer applications, especially those in fields which cannot be complete automated, often fail because user's are left out of the design process, or brought in only at the end . Furthermore, we have few systematic studies of the use of technology libraries and in learning. In this project, user studies will be integrated throughout, and will be used both to shape the design of services, and to evaluate the results. Indeed, we plan to develop an extensive series of studies to determine how users understand and work the system: its interface, data organization and other features. We intend for these studies to go well beyond the usual informal analyses or classroom questionnaires. Professor Jan Hawkins of the Harvard Graduate School of Education is a specialist in the development of technology for classroom work and will design with us a series of studies of the use of the VDC in different types of classes and in different institutions. The research will feed back into the various stages of our design for the VDC, including the enhancements we hope to add, and the research should provide basic information of a more general sort as to the use of technology in higher education. The first phase of this research will place in the framework of the project for which we are applying. The next two phases will take place with support we hope to add in the future.

      A major component of our long-term plan is a series of studies of how these data are used in the classroom (or in the dorm room). We wish to move well beyond the usual methods of learning about use through intuition or anecdote or though simple questionnaires given out in class. We want an in-depth study of use across a wide range of types of users in varied institutions. The research will feedback into the various stages of our design for the VDC, including the enhancements we hope to add. And the research should provide basic information of a more general sort as to the use of technology in higher education.

      In the first phase, under this grant, formative research studies will be conducted to determine how users understand and work with the system: its interface, data organization, and other features. (See section 4.5.)

      In subsequent phases, we will work with a variety of users in these studies in order to thoroughly understand the needs of a range of users who would benefit from the system. These variations will include levels of disciplinary expertise: undergraduate students, graduate students, faculty, citizens. The studies will also include exploring variation by institutional context, including: universities (Harvard, MIT); state and/or community colleges; public libraries. These data are important to design refinements to ensure simple and robust use in intended contexts. We will thus conduct formative field studies beginning in the second project year. We will collaborate with a small group of institutions to conduct these studies, including Harvard/MIT, state and or community colleges in the region, and one or more public libraries. At the higher education institutions, we will recruit/select a small group of faculty who teach a range of courses, and who are willing to integrate the system into their teaching during this project year. It is likely that these faculty will be identified in the institutions in which the formative user studies were done. We will work with faculty to help them to understand the features of the system, as support for incorporation into their teaching. (At Harvard, we will be able to take advantage of the fact that Harvard College will be introducing a new set of courses on Quantitative Reasoning into its Core curriculum. Thus we will have a set of new courses, aimed mostly at non-quantitative students in the humanities and in some of the social sciences, for which data sets will be an important part of the basis of instruction.)

      We will seek a range of social science courses with respect to discipline and teaching style for the field studies. How is the system used in these various contexts? What kinds of assignments and student work are associated with its use? Our overall goals in this work will be to understand how the system can be used in higher education and for public information, the problems that may arise, and to identify specific potential impacts of its use on course design and student learning that will guide the design of the final outcome studies.

      Faculty will be interviewed prior to the use of the system in their course(s), including documentation of prior course design. A sample of students in each class will also be interviewed prior to participation. We will regularly observe classes, as well as sessions in which students use the system as part of course assignments. Course documentation will be collected, as will samples of student assignments that include work with the system. Interviews will again be conducted at the conclusion of the course. These data will be analyzed with particular attention to suggested refinements in design, and to accumulating understanding of the best use contexts in social science courses, and effective curriculum design for use of these resources. We will interview faculty research users about how the system functions for their research needs. We will also identify a public library where citizens will be given ready access to the system. We will collaborate with the library staff encourage a variety of people to use the system for their inquiries. Interviews will be conducted before and after these, in general, one-time use sessions. This will enable us to determine the features of the design that are required for potentially one-time use for citizens' questions.

      The final research phase will focus on outcome studies: to understand the consequences of use of the system for the design and organization of courses, and for learning outcomes. The specific design will grow out of the outcomes of the formative field studies in which we will identify specific concepts, course designs and activities, and social organizations of course work that are most likely to be affected by the incorporation of the system. The research will be focused on pre/post studies of system use, and where possible, comparative studies of system use in similar courses taught by the same faculty member with and without the system. We plan to examine student learning of carefully selected concepts/material that are the focus of the system resources and course assignments, and the quality of student products. We will also collect data that characterize the course context and process of use through interviews, documentation, and observational methods.

      These studies will also be conducted in different types of institutions, including universities and state or community colleges in several locations around the country.

    3. Linking Data to Text:

    Quantitative social science data are tools for research. The analysis of such data appears in professional journals, in scholarly books, and more and more often in more popular media. For the scholar, the connection between text and data is natural, but these connections are sometimes cumbersome or difficult to make. Data that back up an article are often difficult to find and even more difficult to analyze. Thus, our ability to replicate the work of others and to build on it diminished. A similar problem exists for scholars who move from data to published work based on the data. It may not be easy to trace the publications that emerge from a data set -- so that we can build on rather than duplicate that which has come before. Scholarship would be greatly enhanced if one could move easily from data to text and from text to data.

    The connections are, in some sense, even more vital for less sophisticated users of data resources. Researchers know that the reports of research results in a data table are the result of a complex error prone process. Students and others not aware of the nature of social science research may believe the analyses that appear in a publication come from some unquestionable scientific process and are to be copied down and believed. There is no lesson more important for a novice student in the social sciences to learn than the complexity and uncertainty of the research process that moves from data to published text.

    The major enhancement we plan for the VDC is the development of links to social science text. Imagine one is reading an article in a social science journal. Reference is made to a data set that is the basis of the article or a footnote appears to the works of a scholar whose data are relevant. The reader immediately locates the data and proceeds to check the findings in the article or go beyond what has been done. "The author should really have included this or that variable." "The author should have applied a different procedure." The reader can do it! Conversely, a research or student decides to write a paper on subject X. She finds a good data set. But has anyone else published on that topic based on these data? The student can call up the published works on the subject.

    The general goal is clear. Defining it more precisely is more difficult. And accomplishing it more difficult yet. One needs to locate a body of text and a body of data into which links can be placed, design a place in the data structures of body of text and set of data where links can be placed, and provide user interfaces to move from one to the other. In many cases, one would need permission of copyright holders of journals (though one might hope they would be permissive in this since such links only enhance the value of their intellectual property). Naming schemes are needed to locate materials at both ends of the connection. Issues of user authentication and rights of access will be more complex, since users in varied locations would have to be linked to journals and data that also might have varied locations.

    These are major and complex issues, but they are problems faced in relation to many other digital library applications. We would, for instance, take advantage of such developments as those in relation to naming that are being worked on through the Handle system being developed by CNRI. The data base for the quantitative data will be the data under our control in the VDC -- or some selection of it to begin with. The data base for the text may be the JSTOR collection of journals -- though, again, we might limit ourselves to the journals in one social science field. We have discussed these possibilities with both CNRI and JSTOR, and both are interested in pursuing a collaboration. This part of our long-term plan is not part of the current proposal. We cite it here because we think it is a very exciting next step. And we will design the VDC in consultation with these other entities so that transition to this next stage would be possible. It is the intent of our project to explore how digital libraries can be used to deliver complete intellectual works in social science -- unifying data and research articles. We plan to create a digital library of such intellectual works.

  3. Coordinating with Other Institutions
  4. This project will foster the established links we have made among social science data providers, data users, and the library community. This will be facilitated by the fact that this is a joint project between the Harvard-MIT Data Center (a center managed and used by active social scientists) and the Harvard University Library. This system will not be built in a vacuum – but will connect with Harvard’s own digital library development.

    The Principle Investigators come from both the social sciences and the library. Sidney Verba, is both Director of the Harvard University Library and a Professor of Government. Gary King, is Director of the Harvard-MIT Data Center, a Professor of Government, and now also serves on the ICPSR Council.

    We are working closely with professionals at the ICPSR, as well as others working in the field of digital information. Harvard is a founding member of the Digital Library Federation, we have its endorsement for this project, and will be exploring with them ways in which the VDC can be used to complement their initiatives to make high-profile data more widely available. We are working with the Corporation for National Research Initiatives to integrate its system for naming digital objects into our VDC. And we will, in the second phase of the project, work with JSTOR in an effort to join journal articles with sets of data.

    The Harvard-MIT Data Center serves both Harvard and MIT is having discussions with the university of Michigan to provide the same services to their faculty, students, and staff. Even without advertising our work, many other data centers have asked for early version of our system when it becomes available, and have offered to be beta-testers.

    These contacts will be important throughout the development stage, in order to remain consistent with developing standards, and so that one or more of these institutions will eventually take over and institutionalize key parts of the project.

  5. In the Long-Run:

The Harvard-MIT Data Center, and its VDC prototype, is a part of Harvard University's and MIT's library resources. We expect that, as well as extending the features of the current data center, and extending those features to other data centers, the new VDC will be the key to providing a much sought after "virtual union catalog" that indexes quantitative data at multiple institutions. Although Harvard will, for a time, keep such a catalog of the range of data holdings, the data itself will remain distributed and controlled by the owning institutions. We will work closely with the Harvard and MIT libraries to integrate it into the library system. When we are successful, we will not continue to run the VDC indefinitely, and expect that its future will be ensured by its adoption at many sites inside and outside of Harvard. When the system becomes routine, we plan to transfer operational management of it to another institution, such as the ICPSR and/or consortium of participating universities.

Since the VDC will be capable of running on a most modern personal computers and workstations, will be free and extensible, and will allow unprecedented access to the world's research data, we expect that this system will eventually be adopted by a large variety of research centers and universities. Since it will be based upon open, general protocols, such as DLIOP and Z39.50, the VDC will be a useful component of and gateway to federated digital libraries of many types. In addition, we will take a number of steps to encourage the widespread use of the system: During the beta-testing phase, we will distribute the VDC to selected sites at other University data centers. We will also make the VDC freely available, and prepare installation programs that will make it simple for researchers to install and run.

Widespread data sharing will have a broad and significant impact on research, and teaching. Information access technology has the potential to speed the diffusion of research within and across disciplines . Easy access to the data that are the root of published research will be an enormous boon for verification, extension of methodology, and secondary analysis . Easy access to real data will improve teaching .

We hope the VDC will also affect the norms of data sharing in academia. The sciences and social sciences have come to accept data sharing, at least in principle. There is also a movement in political science to provide data as a condition of publication, and a large number of journals have adopted such policies . In many other fields the premier journals, such as Science, Nature, the American Economic Review, the Journal of the American Statistical Association, the Journal of the American Medical Association, Social Science Quarterly, Lancet (and other journals conforming to the "Uniform Requirements for Manuscripts Submitted to Biomedical Journals"), and the various journals published by the American Psychological Association all, in principle, make the provision of important data on which the article is based a condition of publication . In addition, several government organizations, such as the National Science Foundation, require that data resulting from funded projects be made available to other users . Despite these requirements and benefits, much data is not shared - and the discipline has been held back consequently. The norms of sharing data are weak, requirements are fuzzy, and verification of data sharing is lacking. By providing a system for quickly, easily, and verifiably sharing data, the VDC will enable more journals to adopt and enforce data sharing and replication policies. The VDC may even result in better public policy by reducing the opportunity of manipulating the impact of a study by withholding the data on which it is based .