Related Resources

About Digital Preservation

View a short presentation about digital preservation.

Partner Tools and Services

The Library of Congress has formed a growing network of preservation partners both in the United States and abroad to help save digital information that would otherwise be lost.

Tools and Services | Publications

NDIIPP Partner Tools and Services Inventory

This is a list of tools and services designed, developed or used by NDIIPP partners during their projects. By making this list available, the Library encourages NDIIPP partners and others in the preservation community to share in, and take advantage of, the work and resources of our partners.

A | B | C | D | E | F | G | H | I | J | K | L | M | N
O | P | Q | R | S | T | U | V | W | X | Y | Z

A

ACE (Audit Control Environment)

ACE is a prototype tool that validates the integrity of digital files through mathematical techniques. Its purpose is to ensure the authenticity of digital objects in long term archives. ACE consists of a third-party Integrity Management Service (IMS) which issues integrity tokens for digital objects, and a local archive Audit Manager (AM) that periodically validates the repository. Consistency in ACE is guaranteed through the use of the archive independent IMS to validate integrity tokens and with the publication of witness values to prove the correctness of the system.

Developer: University of Maryland
Written in: Java
OS and run-time environment: Web-based and platform independent. Requires Java 1.4 or greater.
Application: (Demo) ACE Client and Mass Registerer, http://adaptwiki.umiacs.umd.edu/twiki/bin/view/Main/
TimeStampingSystemClientDemo
Documentation: http://adaptwiki.umiacs.umd.edu/twiki/bin/view/Main/ACEOverview
License: To be decided
Last tool update: 10/27/2007

Archive-It
Archive-It, a subscription service from the Internet Archive, allows institutions to build and preserve collections of born digital content. Through a web application, Archive-It partners can harvest, catalog, manage and browse their archived collections. Collections are hosted at the Internet Archive data center and are accessible to the public with full-text search. Over 65 memory institutions around the world partner with Internet Archive to archive the web using Archive-It.

Developer: Internet Archive
NDIIPP Project: Internet Archive
Written in: Java
OS and runtime environment: Web based
Application: http://www.archive-it.org/
Documentation: http://webteam.archive.org/confluence/display/ARIH/Welcome
Licensing: Fee based
Last updated: version 2.10 released September 9, 2008

B

BagIt

A format for transferring digital content. Content is packaged (the bag) along with a small amount of machine-readable text (the tag) to help automate the content's receipt, storage and retrieval. There is no software to install. A bag consists of a base directory containing the tag and a subdirectory that holds the content files. The tag is a simple text-file manifest, like a packing slip, that consists of two elements:

1. An inventory of the content files in the bag
2. A checksum for each file.

A slightly more sophisticated bag lists URLs instead of simple directory paths. A script then consults the tag, detects the URLs and retrieves the files over the Internet, ten or more at a time. This type of simultaneous multiple transfer reduces the overall data-transfer time. In another optional file, users can add content metadata.

Developer: Library of Congress, California Digital Library
NDIIPP Project: Web-at-Risk
Written in: n/a
OS and run-time environment: n/a
Application: n/a
Documentation: Bagit Specification (PDF, 83 Kb)
License: n/a
Last tool update: 05/31/08

Bag Validator

The Bag Validator tool is a small Python script that validates a Bag, checking for files in the manifest that are missing from the disk, files on the disk that are not listed in the manifest, and duplicate entries in manifest.

Developer: Library of Congress
Written in: Python
OS and run-time environment: Unix
Application: n/a
Documentation: Contact Leslie Johnston at lesliej [at] loc.gov for information
License: n/a
Last tool update: 06/20/08

C

Conspectus Database for LOCKSS Private Network

LOCKSS software provides inexpensive digital preservation through replication of data storage in multiple locations; the Conspectus Database provides a central "catalog" with records that describe the data in each location. This is used by the MetaArchive project, whose group members are jointly developing:

A prioritized survey of at-risk digital content held at the partner sites
A harvested body of the most critical content at the partner sites to be preserved
A distributed preservation network infrastructure based on the LOCKSS software.

The conspectus database is Web based, searchable, and browsable, and it requires only a login ID and password.

Developer: Emory University
NDIIPP Project: MetaArchive
Written in: N/A. Web based.
OS and run-time environment: N/A. Web based.
Application: http://www.metaarchive.org/conspectus/
Documentation: http://www.metaarchive.org/conspectus/
License: n/a
Last tool update: 11/28/2007

NOTE: The MetaArchive software engineer is currently working with the LOCKSS team to develop new tools to accomplish curation and monitoring tasks for Private LOCKSS Networks in Ruby on Rails. The software will likely be released under an open source license in the near future.

ContextMiner

ContextMiner is a framework to collect, analyze, and present contextual information along with the data. It is based on the idea that while describing or archiving an object, contextual information helps to make sense of that object or to preserve it better.

Developer: University of North Carolina at Chapel Hill, School of Information and Library Science
NDIIPP Project: Vidarch
Written in: N/A. Web based.
OS and run-time environment: N/A. Web based.
Application: http://www.contextminer.org/testdrive.php
Documentation: http://www.contextminer.org/index.php
License: n/a
Last tool update: 11/17/2008

D

Dataverse Network

The Dataverse Network software is an open-source, digital library system for management, dissemination, exchange, and citation of virtual collections (dataverses) of quantitative data. Dataverses can be used or administered through Web-based clients that communicate with a host Dataverse Network.

A Dataverse Network, usually running at a major institution, requires installation of application software. Individual dataverses are self-contained virtual data archives, served out by a Dataverse Network, appearing on the Web sites of their owners (e.g., individuals, departments, projects, or publications). Dataverses are branded in the style of the owning entity, but are easy to set up, require no local software installations, and offer the services of a modern archive controlled by the dataverse owner. Data is displayed in a hierarchy; descriptive information (using DDI, Data Documentation Initiative) can be searched.

Depending on the policies of the dataverse owner, end users may be able not only to download files, but also to extract subsets, and perform statistical analysis online. Dataverses and Dataverse Networks can federate with each other, and with other systems through open protocols (OAI-PMH and Z39.50).

Developer: Institute for Quantitative Social Science, Harvard University
NDIIPP Project: Data-PASS
Written in: Java Platform, Enterprise Edition (Java EE) 5, including Enterprise Java Beans (EJB) 3 and Java Server Faces.
OS and run-time environment: Individual dataverses are managed and used through Web-based graphic user interface. Software installations are required only to create an entire Dataverse Network. Software runs on top of the Glassfish Application Server. Harvard uses PostgreSQL for database software. The data analysis component uses R and Zelig for statistical computing.
Documentation: http://thedata.org/
Application: http://dvn.iq.harvard.edu/
License: Gnu Affero General Public License, version 3: http://gplv3.fsf.org/comments/agplv3-draft-1.html (a version of GPLv3: http://gplv3.fsf.org/).
Last tool update: See http://thedata.org/software/releases and http://sourceforge.net/project/showfiles.php?group_id=194383

Digital Archive

The Digital Archive provides a secure storage environment to manage and monitor the health of master files and digital originals. It also provides a managed storage environment for digital master files that fits in with the workflows for acquiring digital content.
For users of CONTENTdm(R) (either hosted or direct) the Digital Archive is an optional capability integrated with the various workflows for building collections. Master files are secured for ingest to the Archive using the CONTENTdm Acquisition Station, the Connexion digital import capability, and the Web Harvesting service.
For users of other content management systems the Digital Archive provides a low-overhead mechanism for safely storing master files.

Developer: OCLC
Written in: Java
OS and Runtime Environment: Linux, MySQL, Apache, Tomcat
Application: http://oclc.org/digitalarchive
Documentation: http://www.oclc.org/digitalarchive/support/default.htm
Licensing: Fee based
Last updated: Updated on a regular six-week development cycle.

DiscoverInfo

DiscoverInfo is a tool to explore a collection of documents. The tool enables the user to:

Search: Run a full text search in the collection. DiscoverInfo indexes text, HTML, XML, and PDF documents.
Browse: Builds term clouds based on the term occurrences in the collection as well as across the documents. You can browse through the clickable term clouds to discover documents.
Discover: Retrieves relevant information from the indexed collection, but also evaluates the novelty of information in documents with respect to other documents in the collection.

Developer: University of North Carolina at Chapel Hill, School of Information and Library Science
NDIIPP Project: Vidarch
Written in: N/A. Web based.
OS and run-time environment: N/A. Web based.
Application: http://idl.ils.unc.edu/~chirag/DIToolkit/
Documentation: http://idl.ils.unc.edu/~chirag/DiscoverInfo/index.html
License: n/a
Last tool update: 2/11/2007

E

EchoDep Hub and Spoke Framework Tool Suite

With a set of simple tools, Hub and Spoke provides a method for exchanging digital files and metadata among different types of digital management systems built on different platforms. It provides basic interoperability between repositories via a common METS-based profile, a standard programming API, and a series of scripts that use the API and METS profile for creating SIPs and DIPs that can be used across different repositories. Key architectural components are:

The METS profile, which remains mostly neutral regarding content files and structure but defines a minimum level of descriptive (MODS) and administrative (PREMIS) metadata, with an emphasis on preserving technical data and provenance.

The REST-based Lightweight Repository Create, Retrieve, Update, and Delete Service (LRCRUDS), which maps URIs to local identifiers and uses HTTP methods (PUT, GET, POST, and DELETE} to handle packages submitted or disseminated from a repository. Packages are shipped as Zip archives containing a header, METS file, and content files in a format suitable for repository import.

The Hub, which converts from and to the METS profile and manages generation and validation of technical and provenance metadata. Initially the Hub is a package-staging area; the goal is to develop the Hub into a digital repository capable of disseminating packages and handling submissions from other repositories.

Developer: University of Illinois, Urbana-Champaign
NDIIPP Project: ECHO DEPository: Exploring Collaborations to Harness Objects with a Digital Environment for Preservation
Written in: An interpreted language (Java, Perl)
OS and run-time environment: OS Independent
METS Profile: http://www.loc.gov/standards/mets/profiles/00000015.html
LRCRUDS: http://dli.grainger.uiuc.edu/echodep/HnS/LRCRUDS.htm
Application: http://sourceforge.net/projects/echodep/
Documentation: http://dli.grainger.uiuc.edu/echodep/hands/
License: University of Illinois/NCSA Open Source License, http://www.opensource.org/licenses/UoI-NCSA.php
Last updated: Version 0.5 released 02/2008. Continuing development and maintenance releases ongoing.

F

Federated Archive Cyberinfrastucture Testbed (FACIT)

FACIT is technology testbed that explores the use of geographically-distributed storage in a networked environment. It builds on logistical networking technology (see http://loci.cs.utk.edu/) using the Internet Backplane Protocol (IBP) (see http://loci.cs.utk.edu/ibp/) to provide a generic interface for managing distributed storage resources. Each FACIT archive will use L-Store (see http://www.lstore.org) to manage data storage in both its private infrastructure and in the shared storage pool that the federation makes available.

Using L-Store, and leveraging IBP, FACIT archives will automatically mirror each other's content to provide fault-tolerance and increased accessibility. For its wide area storage infrastructure, FACIT archives will participate in the larger Research and Education Data Depot Network (REDDnet) storage network (see http://www.reddnet.org/). Since REDDnet is based on IBP and supports L-Store, FACIT archives will have seamless access to this larger, shared pool of storage.

Developer: UCSB, Vanderbilt, UTK
NDIIPP Project: National Geospatial Digital Archive
Written in: L-Store is written in Java; IBP is written in C
OS and runtime requirements: Linux, Unix
Application: Command line interface; GUI in development.
Documentation: http://www.ngda.org/FACIT.php
License: Berkeley BSD
Last update: March 2008

G

GIS Archiving Toolset

The Toolset prepares vector and raster datasets for archive ingest. Basic pre-ingest functions include limited format validation, fidelity management, virus scanning, dataset characterization, metadata creation and remediation, and SIP organization.

Developer: NCSU
NDIIPP Project: North Carolina Geospatial Data Archiving Project
Written in: Python
OS and runtime requirements: The Toolset was written to run cross-platform, but has only been tested in Linux. Core requirements are met by Python. Extended functionality requires calls to external applications including ClamAV, NOID, 4Suite XML, Unix File, and JHOVE.
Application: Tool is not shared.
Documentation: Tool is not shared.
License: Tool is not shared.
Last update: 3/5/2008

H

Heritrix

Heritrix is a flexible, extensible, robust, and scalable Web crawler capable of fetching, archiving, and analyzing Internet-accessible content.

Developer: Internet Archive
NDIIPP Project: Internet Archive
Written in: Java
OS and runtime requirements: Written in Java. Must have Java Runtime Environment (JRE, http://www.java.com/en/download/index.jsp) and at least Java version 5.0 installed. Default heap size is 256MB RAM. Heritrix not tested, packaged, or supported on platforms other than Linux at this time.
Application: http://crawler.archive.org
Documentation: http://crawler.archive.org/articles/user_manual and http://webteam.archive.org/confluence/display/Heritrix/Home
License: GNU Lesser General Public License 2.1 (http://crawler.archive.org/license.html); migrating to Apache License 2.0 in future
Last update: 2/20/2008

I

integrated Rule Oriented Data Systems (iRODS)

iRODS is a data grid that allows the end-user powerful control over storage management policies and procedures through definition of business rules tailored to the characteristics of the files being managed. It provides an abstraction for data management processes and policies in the same way that the Storage Resource Broker provided abstractions for data objects, collections, resources, users and metadata, but is flexible and customizable.

This is accomplished by coding the processes as micro-services that are controlled by explicit rules. Management policies are mapped to sets of rules. Management processes are mapped to sets of micro-services. Assessment criteria are mapped to queries on the persistent state information generated by execution of each micro-service. A distributed rule engine is installed at each storage location to ensure enforcement of policies independently of the choice of access mechanism. iRods architecture features include:

Peer-to-peer data grid servers, based on a client/server model and distributed storage resources
A database system, for maintaining the attributes and states of data and operations
A rule system, for enforcing and executing adaptive rules

Developer: San Diego Supercomputer Center
Written in: iRODS servers written in C. iRODS clients are written in the appropriate language; Java I/O library, PHP web browser, Python web browser.
OS and runtime environment: Linux, Solaris, Macintosh, and AIX. The iCAT Platforms page at http://irods.sdsc.edu/index.php/iCAT_Platforms lists the supported operating systems and configurations for ICAT-enabled servers. Currently either a PostgreSQL or Oracle database may be used for managing state information.
Application: http://irods.sdsc.edu/index.php/Downloads
Documentation: http://irods.sdsc.edu/index.php/Documentation
Licensing: BSD open source (http://irods.sdsc.edu/index.php/License)
Last updated: Release 1.0 on Jan 23, 2008

J

JSTOR/Harvard Object Validation Environment

JHOVE is an extensible system designed to provide automated and efficient identification and validation of the format of digital files with minimal human intervention. JHOVE can:

Identify the format to which a digital object conforms
Determine the compliance of an object to its format's specification, both in terms of syntax (well-formedness) and semantics (validity)
Characterize an object in terms of its format-specific significant properties

JHOVE defines a Java API and also provides a stand-alone application that runs in either command line or GUI mode. JHOVE supports the following formats: AIFF, ASCII, GIF, HTML, JPEG, JPEG 2000, PDF, TIFF, UTF-8, WAVE, and XML.

Developer: Harvard University
Written in: Java 1.4
OS and run-time environment: JHOVE should be usable on any UNIX, Windows, or OS X platform with an appropriate J2SE installation. It should run on any operating system that supports Java 1.4 and has a directory-based file system.
Application: The download includes both a command line (http://hul.harvard.edu/jhove/using.html#invocation) and a GUI (http://hul.harvard.edu/jhove/using.html#gui) version.
Documentation: http://hul.harvard.edu/jhove/documentation.html
License: GNU Lesser General Public License (LGPL) (http://www.gnu.org/licenses/lgpl.html)
Last update: 12/17/2007

K

L

L-Store (Logistical Storage)

L-Store is low-level system software that leverages the basic powerful protocols of the Internet to move and manage large chunks of data through digital networks, much as the Internet moves and manages emails and other traffic. It is built on the Internet Backplane Protocol (IBP) (see http://loci.cs.utk.edu/ibp/). The L-Store client provides a storage framework for distributed, scalable, and secure access to data. It is to be used on the Research and Education Data Depot Network (REDDnet) infrastructure (see http://www.reddnet.org/). L-Store is designed to provide:

High scalability in both raw storage and associated file system metadata
A decentralized management system
Security
Fault-tolerant metadata support
User-controlled replication and striping of data on file and directory level
Scalable performance in both raw data movement and metadata queries
A virtual file system interface in both a Web and command line form
Support for the concept of geographical locations for data migration to facilitate quicker access.

Developer: Vanderbilt University [University of Tennessee?]
Written in: Java
OS and runtime requirements: Java 1.6 or better
Application: http://www.lstore.org/pwiki/pmwiki.php?n=Docs.CLI-ClientIntro (latest client)
Documentation: http://www.lstore.org/pwiki/pmwiki.php?n=Docs.CLI-ClientIntro
License: BSD: http://www.opensource.org/licenses/bsd-license.php
Last tool update: 2/27/2008

LOCKSS

LOCKSS provides inexpensive digital preservation through replication of data storage in multiple locations. Copies of the same content in multiple LOCKSS replicas are automatically compared to each other, and can be repaired by the comparisons automatically.

Developer:Stanford University
Written in: Java
OS and run-time environment: All POSIX (Linux/BSD/UNIX-like OS), Linux. Most LOCKSS installations use a CD which bundles the LOCKSS daemon with an operating system based on OpenBSD. The LOCKSS team also supports running the daemon on RPM-based Linux distributions and on Solaris. The LOCKSS daemon can run in any environment with a Java VM 1.5 or above and a Unix-like file system. The hosting PC needs at least 1 GB of memory, a CD drive, and at least 250 GB of storage. The current CD distribution supports parallel (PATA) and serial (SATA) ATA and SCSI drives. On Linux and Solaris the daemon can use the full set of storage options.
Application: http://sourceforge.net/projects/lockss/
Documentation: http://www.lockss.org/lockss/Installing_LOCKSS
License: BSD, http://www.lockss.org/lockss/Software_License
Last tool update: LOCKSS boxes check for software updates daily. The daemon is updated every 6-8 weeks; the last such release was 12/20/07. The CD is updated about every 6 months; the last such release was 07/06/07.

Logistical Distribution Network (LoDN)

An experimental content distribution tool. LoDN allows users to store content on the REDDnet and to manage or retrieve that stored content without installing anything or learning to use any complicated software. LoDN is comprised of three elements: 1) Upload and 2) Download clients (powered by Java Web Start) for storing and retrieving data, and 3) a Web interface for managing stored data and browsing public content.

LoDN uses the Logistical Networking infrastructure provided by the Internet Backplane Protocol (IBP) (see http://loci.cs.utk.edu/ibp/) deployed on REDDnet (http://www.reddnet.org) to store file content on IBP storage "depots." Content publishers can use LoDN's Web interface to manage stored data. Content distributors can make LoDN data files available by including an active LoDN link on a Webpage, in an email, or through the LoDN content directory. Users access a file by clicking a LoDN link, thereby starting the LoDN Download Client, and then using the download client to retrieve the file content directly from IBP storage.

Developer: UTK
Written in: Web based, uses Java Webstart
OS and runtime requirements: Any Java capable, version 1.4.2 or better
Application: https://ln.eecs.utk.edu/lodn/
Documentation: https://ln.eecs.utk.edu/lodn/
License: BSD, http://www.opensource.org/licenses/bsd-license.php
Last update: 2/27/2008

M

N

National Geospatial Digital Archive Tools: main page

NGDA Tools provide a suite of tools for graphical search and display of geospatial and map digital data.
http://www.ngda.org/research.php

NGDA/ Alexandria Digital Library: ADL Middleware Server

A distributed, peer-to-peer software component that provides mediated access to digital library collections.

Developer: UCSB
NDIIPP Project: National Geospatial Digital Archive
Written in: Java and Python
OS and runtime environments: It can be run as a Web application inside a servlet container, as an RMI server, or both; has been tested and run under Tomcat in Windows, *nix, and MacOSX. Build requirements include Java and the Apache Ant build tool. Python modules are run inside of Java through an interpreter, so Python is not a requirement.
Application: Not directly accessible to the public. Users can send queries through a User Interface, http://clients.alexandria.ucsb.edu/globetrotter/
Documentation: http://www.alexandria.ucsb.edu/~gjanee/middleware/
License: Open Source for non-commercial use with attribution; see source code for details. We use CVS (Concurrent Versioning System) to store the most up-to-date versions of our code. If you are interested in downloading the source code for this tool, please contact programmers@library.ucsb.edu.
Last update: 3/2007

NGDA/ Alexandria Digital Library: Globetrotter

A Google Maps-based Web client for the Alexandria Digital Library middleware. Globetrotter enables a user to perform spatial searches on spatial data. The user can tune his or her search by adjusting a number of different constraints.

Developer: UCSB
NDIIPP Project: National Geospatial Digital Archive
Written in: The client is written in XHTML, JavaScript, XSLT, and the Velocity Templating language.
OS and runtime environment: Runs under Tomcat, tested only on *nix. Build requirements include Java (1.5 or 1.6) and the Apache Ant build tool.
Application: http://clients.alexandria.ucsb.edu/globetrotter/
Documentation: http://clients.alexandria.ucsb.edu/globetrotter/
Licensing: Open Source for non-commercial use with attribution; see source code for details. We use CVS (Concurrent Versioning System) to store the most up-to-date versions of our code. If you are interested in downloading the source code for this tool, please contact programmers@library.ucsb.edu.
Last Updated: Production release on 10/2007 (v1.1)

NGDA: Format Registry

A wiki-based expert community Web site for collaborative description of geospatial formats.

Developer: UCSB
NDIIPP Project: National Geospatial Digital Archive
Written in: Built out of the Mediawiki software: http://www.mediawiki.org. Written in PHP.
OS and runtime environment: Tested and released on *nix with Apache 2, PHP 5, and MySQL 5.
Application: http://ngda.library.ucsb.edu/format
Documentation: http://ngda.library.ucsb.edu/format/index.php/Help:The_Process (Community Participation Rules), http://ngda.library.ucsb.edu/format/index.php/FormatRegistry:FlatSpace (FlatSpace Extension)
Licensing: Open Source for non-commercial use with attribution. We use CVS (Concurrent Versioning System) to store the most up-to-date versions of our code. If you are interested in downloading the source code for this tool, please contact programmers@library.ucsb.edu.
Last updated: May 2007

NGDA: NGDA Server

Software responsible for the creation of Archive Objects within the archive. Accepts requests with attached data, and properly formats and places that data within the archive.

Developer: UCSB
NDIIPP Project: National Geospatial Digital Archive
Written in: Written in Java using the Spring Framework. The Spring Framework is an open-source Web application framework that works with any servlet container.
OS and runtime: Runs in a servlet container. Tested and run under Tomcat 5 in Windows and *nix. Build requirements include Java (1.5 or 1.6) and the Apache Ant build tool.
Application: The NGDA Server is not available for public use.
Documentation: http://www.ngda.org/research.php
Licensing: Open source for non-commercial applications with attribution. We use CVS (Concurrent Versioning System) to store the most up-to-date versions of our code. If you are interested in downloading the source code for this tool, please contact programmers@library.ucsb.edu.
Last updated: 12/2006

NGDA: Bulk Ingest Tool

A tool used for preparing large collections of data for addition to the archive. After the user has created a template and a configuration file, the Ingest Tool is able to collect files and other data and tie them to an Archive Object identifier. This information is later used to create objects within the Archive itself.

Developer: UCSB
NDIIPP Project: National Geospatial Digital Archive
Written in: Java
OS and runtime environment: Uses a MySQL database for persistent data storage. Users will need to have a user account with write access to a MySQL database. Tested in Windows and *nix. Requires Java (1.5 or 1.6). Current build runs in NetBeans, but code is not dependent on NetBeans as a platform.
Application: An offline tool.
Documentation: http://www.ngda.org/research.php
Licensing: Open source for non-commercial applications with attribution. We use CVS (Concurrent Versioning System) to store the most up-to-date versions of our code. If you are interested in downloading the source code for this tool, please contact programmers@library.ucsb.edu.
Last updated: 3/2007

NGDA: Workflow Tool

A GUI-based tool for taking items prepared by the Bulk Ingest Tool and inserting them into the Archive.

Developer: UCSB
NDIIPP Project: National Geospatial Digital Archive
Written in: Java
OS and runtime environment: As a Java-based GUI, the Workflow Tool should work on any OS that supports Java. Tested under Windows XP and Ubuntu with Java 1.5 and 1.6.
Application: An offline tool.
Documentation: To be posted.
Licensing: Freely available for non-commercial use with attribution. We use CVS (Concurrent Versioning System) to store the most up-to-date versions of our code. If you are interested in downloading the source code for this tool, please contact programmers@library.ucsb.edu.
Last updated: 10/2007

NGDA: ArchiveView

A service that provides a consistent, styled view of objects in the NGDA Archive. XSLT stylesheets can be added or customized to change the available views of an object.

Developer: UCSB
NDIIPP Project: National Geospatial Digital Archive
Written in: Java
OS and runtime environment: ArchiveView is powered by a servlet written in Java and requires a servlet container. Tested under Tomcat 5 in Windows and *nix. Build requirements include Java (1.5 or 1.6) and the Apache Ant build tool.
Application: http://www.ngda.org/ArchiveView/
Documentation: Currently only documented within the code
Licensing: Freely available for non-commercial use with attribution. We use CVS (Concurrent Versioning System) to store the most up-to-date versions of our code. If you are interested in downloading the source code for this tool, please contact programmers@library.ucsb.edu.
Last updated: 5/2007

NutchWAX

Software for indexing ARC files (archived Web sites gathered using Heritrix) for full text search. NutchWAX is based on the open-source Web-search software, Nutch.

Developer: Internet Archive
NDIIPP Project: Internet Archive
Written in: Java
OS and runtime environment: Platform-independent Java, though only tested and primarily used on Linux machines.
Application: http://archive-access.sourceforge.net/projects/nutchwax/
Documentation: http://archive-access.sourceforge.net/projects/nutchwax/apidocs/overview-summary.html
Licensing: GNU Lesser General Public License 2.1; Nutch itself is under Apache License 2.0. Future goal is to merge all NutchWAX functionality into Nutch.
Last updated: 1/17/07

O

P

Parallel Retriever

The Parallel Retriever implements a simple Python-based wrapper around wget and rsync, producing a package in the BagIt spec when given a "file manifest" and a "fetch.txt" file. It has been used to transfer content from several transfer partners hosting rsync and HTTP servers, at rates exceeding 200Mbps over Internet2. It was initially built specifically for Internet Archive rsync transfers, but was extended to support the BagIt spec, and HTTP as well as rsync.

Developer: Library of Congress
Written in: Python
OS and run-time environment: Unix
Application: http://sourceforge.net/projects/loc-xferutils/
Documentation: http://sourceforge.net/projects/loc-xferutils/
License: n/a
Last tool update: 12/18/08

PAWN

Producer-Archive Workflow Network (PAWN) is a workflow system designed for individuals who have small collections of digital files that need to be processed into preservation systems for management and future access. PAWN is not a long-term archiving or content-management system; rather it is a flexible environment that can map the requirements of different producers into various archival states. It can be used to provide bulk ingestion from distributed producers into an archive. A gateway to archive a Storage Resource Broker (SRB) environment is already in place; Gateways are planned to other commonly-used digital management systems (Fedora and dSpace). Components include:
- Client, to ingest data; manage users and record organization; and to trigger transfer into an archive
- Management server. Tracks accounts, records schedules, records sets, packages lists, and provides security for multiple domains
- Scheduler. Allocates space on a receiving server for the transfer. It also controls security and configuration for receiving servers.
- Receiving server. Receives data from clients into a package, allows modification of data depending on user credentials, and transfers data to a backend archive at the direction of an approved user.

Developer: University of Maryland
Written in: Java
OS and runtime environment: Web-based application. Requires Java 1.5 (Java 5) or higher, an account on a PAWN manager, and a keystone to secure traffic through PAWN.
Application: http://adaptwiki.umiacs.umd.edu/twiki/bin/view/Main/PawnDemoClient Client software, keystore, and demonstration accounts
Documentation: http://narawiki.umiacs.umd.edu/twiki/bin/view/Main/PAWN
Licensing: No point release for PAWN yet. Anyone who is interested in using PAWN can contact joseph@umiacs.umd.edu.
Last updated: Demo last updated 0.6.2 in 01/2008.

Q

R

Replication Monitor and Verification

The Replication monitor is currently designed to monitor copies of data in a federated SRB installation. The monitor periodically checks a master site for new data and ensures that copies are created at designated sites. Each replica site operates independent of other sites, ensuring that replication will occur even if the entire data grid is in a degraded state. Extensive logging of any action on data in a collection is provided. In addition, it provides a Web interface for quickly reporting the current state of a distributed collection and its copies. Extensions of the current tools to other distributed environments are planned.

Developer: University of Maryland
Written in: Java
OS and runtime environment: Web-based application. Installation requires Tomcat 6.0+, mysql 4.1+ and Java 1.6+.
Application: http://adaptwiki.umiacs.umd.edu/twiki/bin/view/Main/SrbRepMon2
Documentation: same as for application
Licensing: source code TDB, binaries are available for download without restriction
Last updated: January 2008

S

Storage Resource Broker

The Storage Resource Broker (SRB) is a software tool that allows end-users to be able to organize their digital files in a way meaningful to them, without having to be knowledgeable about the underlying storage technologies. Data may be stored in file systems, tape archives, object-relational databases, and object ring buffers. State information is maintained for each registered entity, enabling uniform access support.

A Metadata Catalog (MCAT) supports retrieval based on queries on attributes instead of physical names and/or locations. The logical name used to identify a file does not change as the file is moved to other storage systems. The access controls on the file do not change as the file is moved, and the metadata associated with the file remain attached to the file or directory.

Developer: San Diego Supercomputer Center
Written in: SRB servers written in C. The SRB clients are written in the appropriate language: Perl load library, Python load library, Java I/O library, C library calls
OS and runtime environment: SRB has been ported to UNIX platforms including Linux, Mac OS X, AIX (ex. SP-2 machines), Solaris, SunOS, SGI Irix and Windows. If you are setting up an MCAT-enabled SRB, you will require an Oracle, DB2, Sybase, mySQL, or PostgreSQL database. The SRB software system itself requires only about 200 MB of storage. For MCAT-enabled servers, the DBMS will require additional space; on Linux, for example, the SRB with PostgreSQL and ODBC take about 700 MB. Any Linux system with a 1.5 GHz CPU should have good performance. Memory size of 1/2 GB or 1 GB will be sufficient. For a heavy-load instance of SRB, it is best to use a commercial DBMS like Oracle. PostgreSQL works fine for initial testing and light to moderate data loads.
Application: http://www.sdsc.edu/srb/index.php/Downloads
Documentation: http://www.sdsc.edu/srb/index.php/Documentation
Licensing: Freely available only to academic organizations and government agencies through a source code distribution. http://www.sdsc.edu/srb/index.php/Client_License
Last updated: Release 3.5 on Dec 3, 2007

T

TubeKit

TubeKit is a toolkit for creating YouTube crawlers. It allows the user to build a tool that can crawl YouTube based on a set of seed queries and collect up to 17 different attributes. TubeKit assists in all the phases of the process, from database creation to browsing and searching interfaces that provide access to the collected data.

Developer: University of North Carolina at Chapel Hill, School of Information and Library Science
NDIIPP Project: Vidarch
Written in: PHP. Web based.
OS and run-time environment: N/A. Web based.
Application: http://www.tubekit.org/download.php
Documentation: http://www.tubekit.org/index.php
License: n/a
Last tool update: 10/5/2008

U

V

VerifyIt

The VerifyIt tool is a script that verifies a MD5 Bag manifest using 11 parallel md5sum processes.

Developer: Library of Congress
Written in: Shell script
OS and run-time environment: Unix
Application: n/a
Documentation: Contact Leslie Johnston at lesliej [at] loc.gov for information
License: n/a
Last tool update: 07/22/08

W

Wayback Machine

The Wayback Machine is a powerful search and discovery tool for use with collections of Web site "snapshots" collected through Web harvesting, usually with Heritrix (ARC or WARC files).

Developer: Internet Archive
NDIIPP Project: Internet Archive
Written in: Java
OS and runtime environment: Platform independent. Tomcat, an Apache.org Java-based Web server, is the only server under which Wayback has been tested and is known to work. There may be others, but they have not been tested nor are they supported.
Application: http://archive-access.sourceforge.net/projects/wayback/
Documentation: http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html
Licensing: GNU Lesser General Public License 2.1 (http://archive-access.sourceforge.net/projects/wayback/license.html); migrating to Apache License 2.0 in future
Last updated: 2/6/2008

Web Archives Workbench

The Web Archives Workbench is a suite of Web capture tools based on principles of managing archived content in aggregates rather than as individual objects. The suite is comprised of:
-Discovery Tool, which helps identify potentially relevant Web sites by crawling relevant "seed" Entry Points to generate a list of domains that they link to
-Properties Tool, which enables you to maintain information about content creators, associate them with the Web sites they are responsible for, and enter high-level metadata
-Analysis Tool, enables you to look at the structure of the Web site to see what kind of content is represented by the file directory
-Harvest Tool, which is used to monitor crawl status, to review and modify harvest settings, and to package harvests for transfer to a repository. The Harvest Tool also offers a separate Quick Harvest feature, which schedules one-time harvests of content. Harvest packages are encoded in METS with Dublin Core metadata embedded.

Developer: OCLC
NDIIPP Project: ECHO DEPository: Exploring Collaborations to Harness Objects with a Digital Environment for Preservation
Written in: Java, JavaScript, JSP
OS and runtime environment: Linux
Application: Download from SourceForge, http://sourceforge.net/projects/webarchivwkbnch
Documentation: Available on SourceForge
Licensing: Available on SourceForge
Last updated: 9/12/2007

Web Archiving Service

The Web Archiving Service (WAS) is a Web-based curatorial tool that enables libraries and archivists to capture, curate, analyze, and preserve Web-based government and political information. The WAS allows users to set parameters of Web crawls, capture sites, provide metadata for archived sites, and build collections of archived Web sites.

Developer: California Digital Library
NDIIPP Project: Web-at-Risk
Written in: Java, Ruby on Rails
OS and runtime environment:
WEB PAGE: Javascript must be enabled in the user’s browser. User must be able to install browser bookmarklets to use the "add sites while browsing" feature. Login and password are required.
BACKEND: Infrastructure consists of Solaris 10 and Linux machines. The heaviest infrastructure demands are processing power for crawling, processing power for indexing, and storage. Other tools used are Heritrix, NutchWAX, Open Source Wayback Machine, MySQL, and Storage Resource Broker.
Application: Still under development
Documentation: http://was.cdlib.org
Licensing: n/a
Last updated: 10/2007

Web Harvester
A service that enables users to harvest content from the Web, review it and add the harvested items to their CONTENTdm® collections during the Connexion cataloging process. By integrating digital collection development and capture with standard cataloging workflows, the Web Harvester provides an additional option for expanding participation in growing and maintaining digital collections.

Harvested items added to CONTENTdm Digital Collection Management Software using the Web Harvester are discoverable from the CONTENTdm Web interface, as well as WorldCat.org, WorldCat Local and OCLC FirstSearch. Each harvested item added to CONTENTdm using the Web Harvester is associated with its WorldCat record via a persistent URL based on the OCLC number of the WorldCat record. With an additional subscription to the OCLC Digital Archive, master files will be automatically placed in the Archive's secure, managed storage system.

Developer: OCLC
Written in: Java
OS and Runtime Environment: Linux, MySQL, Apache, Tomcat, Heritrix
Application: http://oclc.org/webharvester
Documentation: http://www.oclc.org/webharvester/support/default.htm
Licensing: Fee based
Last updated: Updated on a regular six-week development cycle.