Everybody’s Libraries

January 15, 2009

Repository services, Part 2: Supporting deposit and access

Filed under: discovery, formats, repositories — John Mark Ockerbloom @ 6:01 pm

A couple of days ago, I talked about how we provided multiple repository services, and why an institutional scholarship repository needs to provide more than just a place to store stuff.  In this post, I’ll describe some of the useful basic deposit and access services for institutional scholarly repositories (IRs).

The enumeration of services in this series is based in part on discussions I’ve had with our scholarly communications librarian, Shawn Martin, but any wrong-headed or garbled statements you find here can be laid at my own feet.  (Whereupon I can pick them up, smooth them out, and find the right head for them.)

Ingestion:

One of the major challenges of running an institutional repository is filling it up with content: finding it, making sure it can go in, and making sure it goes in properly, in a manageable format, with informative metadata.  Among other things, this calls for:

  • Efficient, flexible, user-friendly deposit workflows. Most of your authors will not bother with anything that looks like it’s wasting their time.  And you shouldn’t waste your staff’s time either, or drive them mad, with needlessly tedious deposit procedures they have to do over and over and over and over again.
  • Conversion to  standard formats on ingestion. Word processing documents, and other formats tied to a particular software product, have a way of becoming opaque and unreadable a few years after the vendor has moved on to a new version, a new product, or that dot-com registry in the sky.  Our institutional repository, for instance, converts text documents to PDF on ingestion, which both helps preserve them and ensures wide readability.  (PDF is an openly specified format, readable by programs from many sources, available on virtually all kinds of computers.)
  • Journal workflows. Much of what our scholars publish is destined for scholarly journals, which in turn are typically reviewed and edited by those scholars.  Letting scholars review, compile, and publish those journals directly in the repository can save their time, and encourage rapid, open electronic access.   (And you don’t have to go back and try to get a copy for your repository when it’s already in the repository.)  Our BePress IR software has journal workflows and publication built into it.  Alternatively, specialized journal editing and publishing systems, such as Open Journal Systems, also serve as repositories for their journal content.
  • Support for automated submission protocols such as SWORD. Manual repository deposit can be tedious and error-prone, especially if there are multiple repositories that want your content (such as a funder-mandated repository, your own institution repository, and perhaps an independent subject repository.)  Manual deposit also often wastes people’s time re-entering information that’s already available online.  If you can work with an automated protocol that can automatically put content into a repository, though, things can get much better: you can support multiple simultanous deposits, ingestion procedures designed especially for your own environment that use the automated protocol for deposit, and automated bulk transfer of content from one repository to another.  SWORD is an automated repository deposit protocol that is starting to be supported by various repositories. (BePress does not yet support it, but we’re hoping they will soon).

From a practical standpoint, if you want a significant stream of content coming into your repository, you’ll probably need to have a content wrangler as well: someone who makes sure that authors’ content is going into the repository as intended. (In practice, they often end up doing the deposit themselves.)

Discovery:

You want it to be easy and enjoyable for readers to explore your site and find content of interest to them.  Here are a few important ways to enable discovery:

  • Search of full text and/or metadata, either over the repository as a whole, or over selected portions of the repository.  Full text search can be simple and turn up lots of useful content that might not be discovered through metadata search alone.  More precise, metadata-based searches can also be important for specialized needs.   Full text indexing is not always available (in some cases, you might only have page images), but it should be supported where possible.
  • Customization of discovery for different communities and collections.  Different communities may have different ways of organizing and finding things.  Some communities may want to organize primarily by topic, or author, or publication type, or date.  Some may have specialized metadata that should be available for general and targeted searching and browsing.  If you can customize how different collections can be explored, you can make them more usable to their audiences.
  • Aggregator feeds using RSS or Atom, so people can keep track of new items of interest in their favorite feed readers.  This needs to exist at multiple levels of granularity.   Many repositories give RSS feeds of everything added to the repository, but most people will be more interested in following what’s new from a particular department or author, or in a particular subject.
  • Search engine friendliness. Judging from our logs, most of the downloads of our repository papers occur not via our own searching and browsing interfaces, but via Google and other search engines that have crawled the repository.  So you need to make sure your repository is set up to make it easy and inviting for search engines to crawl.  Don’t hide things behind Flash or Javascript unless you don’t want them easily found.  Make sure your pages have informative titles, and the site doesn’t require excessive link-clicking to get to content.  You also need to make sure that your site can handle the traffic produced by search-engine indexers, some of which can be quite enthusiastic about frequently crawling content.
  • Metadata export via protocols like OAI-PMH.  This is useful in a number of ways:  It allows your content to be indexed by content aggregators; it lets you maintain and analyze your own repository’s inventory; and, in combination with automated deposit protocols like SWORD (and content aggregation languages like OAI-ORE), it may eventually make it much simpler to replicate and redeposit content in multiple repositories.

Access:

  • Persistent URIs for items. Content is easier to find and cite when it doesn’t move away from its original location.  You would think it would be well known that cool URLs don’t change, but I still find a surprisingly large number of documents put in content management systems where I know the only visible URIs will not survive the next upgrade of the system, let alone a migration to a new platform.  If possible, the persistent URI should be the only URI the user sees.  If not, the persistent URI should at least be highly visible, so that users link to it, and not the more transient URI that your repository software might use for its own purposes.
  • An adequate range of access control options for particular collections and items.  I’m all in favor of open access to content, but sometimes this is not possible or appropriate.  Some scholarship includes information that needs to be kept under wraps, or in limited release, temporarily or permanently.  We want to still be able to manage this content in the repository when appropriate.
  • Embargo management is an important part of  access control.   In some cases, users may want to keep their content limited-access for a set time period, so that they can get a patent, obey a publishing contract, or prepare for a coordinated announcement.  Currently, because of BePress’ limited embargo support, we sit on embargoed content and have to remember to put it into the repository, or manually turn on open access, when the embargo ends.  It’s much easier if depositors can just say “keep this limited access until this data, and then open it up,” and the repository service handles matters from there.

That may seem like a lot to think about, but we’re not done yet.  In the next part, I’ll talk about services for managing content in the IR, including promoting it, letting depositors know about its impact, and preserving it appropriately.

January 14, 2009

January 13, 2009

Repository services, Part 1: Galleries vs. self-storage units

Filed under: repositories — John Mark Ockerbloom @ 6:00 pm

Back near the start of my occasional series on repositories, I noted that we had not just one but a number of repositories, each serving different purposes.

In tight budgetary times, this approach might seem questionable.  Right now, we’re putting up a new repository structure (in addition to our existing ones) to keep our various digitized special collections and make them available for discovery and use.  We hope this will make our digital special collections more uniformly manageable, and less costly to maintain.

At the same time, we’re continuing to maintain an institutional repository of our scholars’ work on a completely different platform, one for which we pay a subscription fee annually.  I’ve heard more than one person ask “Well, once our new  repository is up, can’t we just move the existing institutional repository content into it, and drop our subscription?”

To which I generally answer: “We might do that at some point, but right now it’s worth maintaining the subscription past the opening date of our new repository.”  The basic reason is that the two repositories not only have different purposes, but also, at least in their current uses, support very different kinds of interactions, with different kinds of audiences.

The interactions we need initially for the repository we’re building for our special collections are essentially internal ones.  Special collections librarians create (or at least digitize) a thematic set of items, give them detailed cataloging, and deposit them en masse into the collection.  The items are then exposed via machine interfaces to our discovery applications, that then let users find and interact with the contents in ways that our librarians think will best show them off.

The repository itself, then, can work much like a self-storage unit.  Every now and then we move in a bunch of stuff, and then later we bring it out into a nicer setting when people want to look at it.  Access, discovery, and delivery are built on top of the repository, in separate applications that emphasize things like faceted browsing, image panning and zooming, and rare book page display and page turning.

Our institutional repository interacts with our community quite differently.  Here, the content is created by various scholars who are largely outside the library, who may deposit items bit by bit whenever they get around to it (or when library staff can find the time to bring in their content).  They want to see their work widely read, cited, and appreciated.  They don’t want to spend more time than they have to putting stuff in– they’ve got work to do– and they want their work quickly and easily accessible.  And they’d like to know when their work is being viewed.  In short, they need a gallery, not just a self-storage unit.  They want something that lets them show off and distribute their work in elegant ways.

Our institutional repository applications, bundled with the repository, thus emphasize things like full text search and search-engine openness, instant downloads of content, and notification of colleagues uploading and downloading papers.

We could in theory build similar applications ourselves, and layer them on top of the same “self-storage” repository structure we use for special collections.   (Museums likewise often have their exhibit galleries literally on top of the bulk of their collection kept in their basements, or other compact storage areas.)  But it would take us a while to build the applications we need, so for now we see it as a better use of our resources to rely on the applications bundled with our institutional repository service.

(An alternative, of course, would be to see if an existing open source application would serve our needs.  I hope to talk more about open source repository software in a future post, but we haven’t to date decided to run our institutional repository that way.)

I hope I’ve at least made it clear that for a viable institutional repository, you need quite a bit more than just “a place to put stuff”: you need a suite of services that support its purposes.  In Part 2,  I’ll enumerate some of the specific services that we need or find useful in our institutional scholarship repository.

January 1, 2009

Public Domain Day 2009: Freeing the libraries

Filed under: copyright, discovery, open access — John Mark Ockerbloom @ 2:30 pm

In many countries, January 1 isn’t just the start of a new year: it’s the time when a new year’s worth of works are welcomed into the public domain.  As I noted in last year’s Public Domain Day post, countries that use the copyright terms specified by the Berne Convention bring works into the public domain on the first January 1 that’s more than 50 years after the death of their authors.  So today, most works by authors who died in 1958 join the public domain in those countries.  This page at authorandbookinfo.com lists many such authors, and their books.  Some of the more notable names include James Branch Cabell, Rachel Crothers, Dorothy Canfield Fisher, C. M. Kornbluth, Mary Roberts Rinehart, Robert W. Service, and Ralph Vaughan Williams.

Many countries, however, have extended their copyright terms in recent years.   Most European Union countries, for instance, took 20 years worth of works out of the public domain in the 1990s when the EU mandated that copyright terms be extended to run for the life of the author plus 70 years.  This year, they come a little bit closer to recovering their lost public domain, welcoming back works by authors who died in 1938, including people like Karel Capek, Zona Gale, Georges Melies, Constantin Stanislavsky, Osip Mandelstam, Owen Wister, and Thomas Wolfe.

In some other countries, very little is entering the public domain today.  Here in the US, we’re midway through a freeze on most copyright expirations, resulting from a term extension enacted in 1998.  We now have 10 years to go until copyrights on published works start expiring again due to age. (By 1998, all works copyrighted prior to 1922 had entered the public domain.  Remaining copyrights from 1923 are scheduled to expire at the start of 2019.) Some special interests would like to make copyright terms even longer (even “forever less one day”, as Congresswoman Mary Bono requested on behalf of the movie industry).  Those of us who value the public domain will need to ensure that it is not further eroded, and that copyrights are allowed to expire on schedule.  This is in keeping with the intents of the country’s founders, who specified in the Constitution that copyrights were meant to last only for “limited times”.

But even though few works are entering the public domain in the US today, many more works are now freely and easily available to the public today than a year ago.  Much of this is thanks to initiatives like Google Books and the Open Content Alliance, which are digitizing books and other works that libraries have acquired and preserved.  Many of the digitized works are in the public domain, and these projects have been making them freely readable and downloadable when they can confirm their public domain status.  And now that Google has negotiated a settlement with book publisher and author groups,  they plan to be more proactive about identifying and releasing public domain works, including works published after 1922 that are out of copyright (but are not so easy identified as public domain as older books are).

These works have been part of the public domain for years, but when they were simply sitting on the shelves of a few research libraries, they weren’t doing the public much good.  Once they’re digitized, though, and their digitizations and descriptions are shared online, they can be much more easily found, read, adapted, and reused by anyone online.  By opening up the treasure trove of public domain expression that libraries have preserved, we magnify its value.  When libraries share their intellectual endowment, they better fulfill their mission to bring art and knowledge to readers, and make it easy for readers to learn, build on, and be enriched by this knowledge.

I wish I could say that libraries always acted with this understanding.   Unfortunately, all too often libraries and affiliated organizations have been resistant or slow to share the information they compile and control.  The effective value of what libraries offer has been significantly diminished as a result.

Sometimes libraries simply have not moved as quickly as they could.  The Copyright Office has long provided online access to copyright records, but only from 1978 onward.  I started digitizing older copyright records over 10 years ago, and a few libraries started doing so as well, but many older records have not yet been publicly digitized, though they’re available in printed form in many government depository libraries.  These records can make it much easier to verify public domain status of many works, and then make them available to the public.

Sometimes libraries and affiliated organizations put up their own restrictions on sharing information they already have in digital form.  I had a series of posts in November, for instance, criticizing OCLC’s newly revised restrictions on sharing and reusing catalog records that libraries have contributed to WorldCat, the largest shared cataloging resource for libraries. The data in WorldCat can be the basis for many useful and innovative applications to direct readers towards useful information resources, and information about those resources.  And in December, an extremely useful downloadable semantic web representation of Library of Congress subject headings, the basis for information discovery applications like this one, was ordered taken down by LC administrators.

In the new year, I hope to encourage libraries to be more open in sharing their knowledge resources (and to support partners that also enable such openness).  My gifts to the public domain this year are in that spirit.

The first one, dedicated immediately to the public domain, is the start of a simple, free decimal classification system, intended to be reasonably compatible with certain existing library standards, but freely available and usable by anyone for any purpose.  (I created this after someone requested such a system for their institutional repository, and found out that the current Dewey Decimal system is subject to usage restrictions based on copyright and trademark.)  While this is more of a proof of concept than something I expect libraries to adopt in great numbers, I hope it inspires further open sharing of library metadata and standards.

Also, as I did last year, I’m dedicating another year’s worth of copyrights that I can control, this time from 1994, to the public domain, so that they follow the initial 14 year copyright term originally prescribed by this country’s founders.  These copyrights include the first versions of Banned Books Online, and the first database-driven versions of The Online Books Page.  Versions of these resources from 1994 and earlier are now given to the public domain.

I hope readers find value in these, and all the other public domain and freely licensed works they can enjoy and use online. Happy Public Domain Day!

Update: See also the Public Domain Day posts at Creative Commons, and the Center for Internet and Society.

December 9, 2008

Revised ILS-Discovery interface recommendation released

Filed under: architecture, discovery, libraries — John Mark Ockerbloom @ 3:40 pm

I’ve just sent the following announcement out to the ILS-Discovery Interface Google Group:

The Digital Library Federation’s ILS-DI task group has officially released revision 1.1 of their recommendation for standard interfaces for integrating the data and services of the Integrated Library System (ILS) with new applications supporting user discovery.

Our initial official release (”revision 1.0″) was made in June, and included a recommendation of a basic level of interoperability (the Basic Discovery Interfaces, or “Level 1″ interoperability) that was agreed to by many ILS vendors in the “Berkeley Accord“.

In August, the DLF convened an implementor’s meeting in Berkeley that was attended by a number of developers and vendors of ILS and discovery software.  In the meeting, we agreed to make certain changes to clarify the requirements of the basic level of compliance, and to make them more useful for discovery applications.  A revised draft that included these changes was made available for comment at the end of October.  We now release the final version.

We hope that this revision will be useful for people implementing ILS’s, ILS interaction layers, and discovery applications, and enable easier interoperation between ILS’s (existing and planned) and innovative discovery applications of all kinds.  We look forward to seeing implementations of these recommendations (some of which are already in progress), and further progress towards interoperability and improved discovery of the knowledge resources of libraries.

I’d like to re-echo my thanks I made on the release of  our “1.0 revision”  back in the summer, and thank everyone who helped write, comment on, and support this recommendation.

And now, I think I’ve got some implementation work to do…

November 24, 2008

November 21, 2008

November 19, 2008

October 30, 2008

DLF ILS Discovery Interfaces: Revised recommendation draft open for comments

Filed under: architecture, discovery, libraries — John Mark Ockerbloom @ 3:41 pm

Today we released a draft of “revision 1.1″ of the ILS Discovery Interfaces recommendation. As I discussed in my previous post, this revision is intended to clarify the implementation of the Basic Discovery Interfaces recommended for integrated library systems (ILS’s), and make them more useful for discovery applications.

On the DLF ILS Discovery Interfaces web site, you’ll find the revision draft and the accompanying schema, along with the initial official recommendation (or “revision 1.0″). My last post included a summary of the major changes from version 1.0.

We’d like to give folks a chance to comment on the changes before we make them official. We’ll take comments until November 18, shortly after the end of the DLF Fall Forum, so folks wanting to go to our birds of a feather session on implementing the recommendations can talk with us there and still have some time to send in written comments. (Or, you can send them in ahead of time so we can think on them at the forum.) Comments may be emailed to me, and I will pass them along to the rest of the task group. There’s also still the open Google Group for discussions.

I’m hoping we’ll start to see Basic Discovery Interfaces implementations, clients, and test suites soon based on the new recommendations and schema. They’re not that different from version 1.0, but should be more useful. I’m working on revising my example implementation now, and hope to see more implementations in the not too distant future. And I look forward to hearing interested people’s thoughts and comments as well.

October 20, 2008

Update on ILS-Discovery Interface work

Filed under: architecture, discovery, libraries — John Mark Ockerbloom @ 4:01 pm

It’s been a while since I posted about the official release of the Digital Library Federation’s ILS Discovery interface recommendation. Marshall Breeding recently posted a useful update on the further development of the interfaces at Library Technology Guides. As the chair of the ILS-DI task group, which is now charged with some followup work described in Marshall’s article, I’d like to add some further updates.

As Marshall mentions, the DLF convened a meeting in August inviting potential developers of the ILS-Discovery interfaces to discuss implementations of recommendations of the DLF’s ILS-Discovery Interface task group. In the course of the discussion, a few changes were suggested and generally agreed upon by the participants. Updating the recommendation was not the main purpose of the meeting, but as we discussed things, it became clear that some clarifications and small updates to the recommendation would be helpful for producing more consistent and useful implementations of the Basic Discovery Interfaces, the interoperability “Level 1″ that was agreed to in the Berkeley Accord.

The ILS-DI task group is therefore preparing a slight revision, to be known as “version 1.1″ of the recommendation. A draft of this revision will be released for comment shortly, and will include the following changes, summarized here to give developers some idea of what to expect:

  • For the HarvestBibliographicRecords and HarvestExpandedRecords functions, it will be clarified that the function should return the records that are available for discovery. (That is, suppressed records and others that might be in the ILS but aren’t intended for discovery will not be shown, except possibly as deleted records as described below).
  • Support for the OAI-PMH binding for these functions will be noted as required. (That is, it must be supported for full ILS-BDI compliance; other bindings can be supported too.) It will also be noted that Dublin Core is a minimum requirement for returned records (as it is for OAI-PMH in general), and that if MARC records exist in the ILS (or are produced by it), MARC XML should also be available.
  • We also will require some level of support for deleted records (which includes records no longer available for discovery), to make it feasible for discovery apps to keep in sync with the ILS’s records via incremental harvesting. We’ll note that ILSs should document how long they keep deleted-record information.
  • For GetAvailability, the simple availability schema defined in the document will be noted as required. (That is, it should be returned for full ILS-BDI compliance; other schemas can be supported too if asked for and supported.) There was some talk at the August meeting about completely dropping the alternative NCIP and ILS-Holdings schemas as replies to GetAvailability, because of their complexity. The draft at this point doesn’t go that far, but it will specify the simple availability schema as the default, and the required, schema to support in the ILS-BDI profile.
  • That simple availability schema will also be augmented slightly to include an optional location element, distinct from the availability-message element. Location was the one specific data field that many implementors said was essential to include that wasn’t in the original schema.
  • We will also add a request parameter to GetAvailability for specifying whether bib or item-level availability is desired if a bib. identifier is given. (Formerly the server had the option of choosing the level in that case; there was a strong sentiment in discussions that the client to be able to specify this.)
  • We expect to leave GoToBibliographicRequestPage alone.

The new draft will be released shortly, and be open to public comment for at least a couple of weeks before we make a last edit for an official release. Feedback is welcome and encouraged, and public discussion can take place in the ILS-DI Google Group, among other places

The new draft will be accompanied by a revised XML schema. The current schema, reflecting the original or “version 1.0″ official recommendation, can be found here. For the location of the new one (which is not yet posted), substitute “1.1″ for “1.0″ in the schema URL. (We intend to keep the old schema up for a good while after the new one is posted, for compatibility with implementations based on the original recommendation.)

I will also be leading a Birds of a Feather session at the upcoming Digital Library Federation fall forum in Providence next month. This will be an opportunity for developers of interfaces implementing the DLF’s ILS-Discovery interface recommendations to present their work to others, ask and answer questions about the recommendations and their implementations, and discuss further development initiatives and coordination. If you’d like us to set aside some time to show or discuss a particular initiative or project you’re working on, let me know.

Watch this space and the ILS-DI Google Group for further developments. And if you can come to the session at DLF in November, I hope we’ll have an interesting and enlightening discussion there as well.

(Update, Oct. 30: The draft of the revision is now out for comment.)

Next Page »

Blog at WordPress.com.