London’s Natural History Museum is one of the initiators of the Biodiversity Heritage Library project. It’s a building block in one of the most important global digitisation programmes ever undertaken. Graham Higley explains to Elspeth Hyams.

Researchers in the biological sciences will soon have a comprehensive reference source, an encyclopedia of all the species ever named.

The Encyclopedia of Life (EOL), officially launched in May 2007, is a huge project, which aims to set up a website for every species on the planet – 1.8m of them that we know about so far. A high proportion of these websites will contain very little information, because about a million species have only ever been described once. But other species, such as polar bears, have a wealth of literature. More intensively studied species will have long and complex documentation, and links to hundreds of other websites.
Each species’ site will have two components: an edited, controlled, scientific view of the species, managed by a scientific editor, who will look after ‘authoritative content’; and a component ‘just like Wikipedia’, to which members of the public can contribute information.

The ‘editors’ will trawl each species’ community’s contributions for material that can be included in the authoritative content. In biology, as with so many specialist topics, the world’s most knowledgeable person on a particular species may be an amateur, not a scientist.

The EOL project will take about 10 years to complete. There is already $50m of funding in place, and the cost of the whole project is expected to be in the region of $70m-$100m.
EOL would not be possible without a fundamental building block, a database containing the majority of all literature on systematic biology. This material will be digitised, under the auspices of the Biodiversity Heritage Library (BHL) project. That initiative came out of ‘Libraries and Laboratories’, a conference at the Natural History Museum (NHM) just over two years ago. The conference was organised by the NHM Head of Library & Information Services, Graham Higley, with funding from the US Sloan Foundation, and it addressed the interaction between biological science and biological literature.

Role of literature in biological science
‘In biology there is a unique relationship. A name for a new species only becomes valid when it has been published in a peer-review journal. Every species that has a name will have a journal reference somewhere, and a description somewhere in the literature. There is an integral relationship between the biodiversity literature, the practical biological field science, and species nomenclature.

‘Nomenclature and identification is absolutely critical to understanding how biology is done in the field,’ says Graham. ‘If you are counting blackbirds, you need to be sure you are counting a specific species, not a group of species that look the same.’
In some respects, biology is a unique science. Its early literature – the first time a species was described, maybe several hundred years ago – is as important as that related to the most recently identified specimens. Field biologists must have access to the whole of this literature to determine if the specimens they find are examples of a new, or previously identified species. Cataloguing new species, monitoring concentrations of previously known species, or their appearance in new locations and disappearance from old, is critical – for example, to monitor the impact of climate change.

The problem for researchers involves location. Most of the interesting biology is in the developing world. The Amazon rainforest, for example, may have some 100,000 species per sq km, while the whole of Britain has only 80,000 species in total. The level of species richness is much higher in the developing world. Unfortunately, most of the literature is in the developed world, in print, out of reach of the scientists who need it in the field. In the past many expeditions were sent out from the West’s great museums and botanical gardens to capture and identify species in the jungles of Africa, the Amazon, the deep sea and so on. This information is captured in the literature in our libraries.

Those in poor or developing countries who need to consult the literature can spend much of their research funding on visits to, and accommodation in, expensive cities like London. Some 7,500 visitors a year come to the museum from developing countries.

A classic example of the problem is the Biologia Centrali Americana. It consists of more than 60 300-page volumes, compiled between 1900 and 1907. It was a comprehensive listing at that time of all the known animals and plants in Central America, from Mexico to Colombia – a massive piece of work. There are only a few sets of the complete work in the world, none of them in Central America.

The obvious way to overcome the problem and make resources available to the scientists who need them is to digitise. That is why, following the conference, librarians from the NHM, eight US institutions, and Kew Gardens, set up the BHL consortium, with a compelling business case to scan the biodiversity literature. They believed that it was so strong that funding would be found, from governments or charitable foundations.

One of the attendees at the conference was champion of the Internet Archive Brewster Kahle, who is ‘keen on getting as much published information on to the web for free as possible’. Indeed, the Internet Archive will do the scanning at an affordable cost – 20p a page in Britain, 20c in the US. This low price was critical. ‘At 20c per page, suddenly doing 200 million pages doesn’t look like a really big funding problem.’

A number of technical meetings took place, and some small sums of money had already been raised when the EOL project came over the horizon. A total of $3.5m has already come through the EOL, and other funding possibilities are in the pipeline.

Additionally, the Malaysian government has decided it wants to build a natural history museum of its own. It plans to work with the BHL project to scan all the literature relevant to Southeast Asia. The programme is growing at breathtaking speed. China and Australia are planning to scan their literature too. The whole idea has snowballed in the last 12 months. There have been big breakthroughs elsewhere. In Europe a proposal has been developed in Germany to scan not only all the literature in German, but all the languages of the former Austro-Hungarian Empire. The French are securing funding through the Bibliothèque Nationale de France scanning project. Dutch colleagues, led by the Naturalis Museum in Leiden, will be working with the Dutch National Library mass digitisation project.

Despite the project’s short life, 10 scanning units (a ‘pod’) are already at work in Boston, dealing with material from the Harvard Zoological and Botanical Museums and the Marine Biological Laboratory in Woods Hole. Another 10 units are going to Washington DC for the Smithsonian Institution. And the project is also working with the New York Public Library scanning centre.
 
A database is being built, listing all the book and journal titles to be scanned, indicating which institution is planning to scan each, and against what timescale. And it will be possible to see what material has been done, and which titles are in preparation.
The project will take about five years, and will need to ‘push volume through the machines in an “industrial” process. If we pick off logical groups of material, and do those first, we can keep the costs down.’ The NHM has more rare and unusual materials than anyone else, but staffing and accommodation costs in London are much higher. So the other partners are scanning what they have, and the NHM will fill in the gaps, scanning only the material nobody else has. Kew, too, has much early, unique material.

IPR issues
How could such a project take place, when so much material is still in copyright? It is perhaps here that the project has had its biggest breakthroughs.

For material in the public domain – out of copyright – there are no problems. But a lot of material comes from learned societies in the biological sciences, which publish journals and monographs. The project has been contacted by many of them, offering permission to scan their material.

‘The interesting thing is that we thought we would have to spend quite a lot of time and effort persuading people to let us have their journals right up to date. The reverse is happening. A wall of people is coming to us, saying, “Will you do our title?”. We’ve signed agreements with more than 20 different institutions. And we only had the agreement document in place a month ago! A lot more are talking to us.’ The agreements will be attached to the scanned copies on the BHL Portal – in many cases, on the basis of a Creative Commons licence.

There have even been potential breakthroughs with two significant commercial publishers. Though not yet prepared to name them (look out for announcements in the New Year), Graham says that between them they have 300 relevant titles and will enable the project to make this material available through the BHL Portal. An OCR copy will be made available by the publisher, which will keep the PDF masters on its own system. ‘The users will be able to search through our portal for the OCR and read the text of the articles. If they want an original, they will buy the article through a commercial portal, just as they would if they went to the companies’ own websites.’

Graham had few difficulties persuading them. ‘I made the point that we are going to have a massive biology portal, where most people will come to find their biological publications. Why wouldn’t they want to have their material in that portal as well, given that, when people want the full master copy – the PDF – they will be able to collect the income, as they would normally?’ Why would anyone want to go to a publisher’s site first – where there might be 20, 30, 100 titles – when they could access the BHL site with 25,000 titles?

‘The key is that this is a free access portal, and they can’t expect to be “in” without giving something away. That thing is the OCR, but it won’t contain the images, and the graphics. They are retaining control over the component where they have added value.’

There are other issues – orphan works, for example. ‘We have thought hard about orphan works, and the changes to orphan works legislation, which we hope will be beneficial, but it’s not quite clear exactly how yet.’ Another problem is the different legislative regimes in different countries.

Once, technical issues would have been a problem. Not now. ‘We’re using a package of standards from TDWG – the Taxonomic Databases Working Group. My colleague Neil Thompson has been one of the key players. We’ve agreed a whole range of core metadata standards for information about specimens, and about the literature, so we can link it up in an effective way, using the binomial nomenclature (Latin name of the species) as the linkage between the multiple components in the biological data sphere.

The NHM here has 70m specimens. The original “type” (the specimen to which the name was given) will often exist in one of the museums and botanical gardens in the partnership, and we can link the specimen information to the BHL literature in a seamless way. In addition, we’ve developed a tool called Taxonomic Intelligence, which is a rather sophisticated indexing system that enables you to group both the biological name and all the common names.’

The technical challenges are ‘not that leading edge. Other people have done the kind of semantic web work involved in the management of nomenclature. We’re adapting it to the biological world. It’s a question of stitching together rather than developing completely new stuff.’

As you might expect with projects of this kind and scale, plans are afoot to build on open access architecture. This will mean that, eventually, scientists, educators, and others, will be able to build their own toolkits on the EOL or BHL systems. ‘We don’t believe we will be able to think of, or have the money to develop, all the different sorts of application that, ultimately, will derive from this data – either from the EOL species pages or the BHL literature.’

A global role model?
The arguments for this project are so compelling that it’s too early to say whether it will become an exemplar for other types of global, collaborative, open access projects or simply a stellar exception to the rule. But if universal public good is a factor, the speed of its acceptance could be an indication. ‘It’s a project model that could be used in a number of domains in future, many of them not scientific. Social sciences and history are likely candidates.’

What is certain is that, with compelling arguments, and the right leadership, extraordinary common purpose can be achieved in a remarkably short time.

‘We thought that at this point in 2007 we would still be pushing to get serious funding. It’s all come much quicker than we expected.’

Graham Higley is Head of Library & Information Services, Natural History Museum (
g.higley@nhm.ac.uk).

www.nhm.ac.uk  
www.eol.org  
www.biodiversitylibrary.org  

Updated: 27 November 2007