OSTIblog: Articles and comments about accelerated science discovery

OSTI accelerates the pace of discovery by making R&D findings available to researchers and the American people

Navigate
Subscribe

The Science Knowledge Imperative: Making non-Googleable Science Findable

Just as science progresses only if knowledge is shared, accelerating the sharing of knowledge accelerates science. All of us engaged in disseminating science knowledge have the opportunity and obligation to do our jobs better, for to do so accelerates science itself. 

To this end, I propose a grand challenge—to make more science available to, and searchable by, more people than ever before. A momentous milestone will be achieved once we enable everyone with web access the ability to search with unparalleled precision a billion pages of authoritative science. Already, considerable progress has been made.  

My organization, the U.S. Department of Energy (DOE) Office of Scientific and Technical Information (OSTI) is responsible for the scientific and technical information operations of the Department. Over the last 11 years, OSTI has become entirely web-based. Of course, we are just one among many entities who connect people to knowledge using the web. Most notably, Google, Yahoo!, and other conventional search engine providers do this, too.

Google and other conventional search engines do for the web what publishers have long done for books—they create an index so that customers can quickly find information. Web users value this service so highly that search companies have become phenomenally successful enterprises. 

However, an important misunderstanding has sprung up about Google and the others. That is, the false presumption, especially among young people, that most useful information is available via conventional search engines such as Google and Yahoo!

In fact, much of the information on the web is inherently unavailable to Google and Yahoo! This key limitation would come as a surprise to many web users. The concept that if you "Google" long enough you can find it is so firmly entrenched in the web-cognizant public that the word "Google" has been elevated to a verb.

In the web-savvy vernacular, "To Google" of course means to search the web using the Google search engine. For example, to find information about "Teddy Roosevelt", one "Googles" Teddy Roosevelt.

So we accept that "Google" has become a verb. This has led folks at OSTI to do some word-creation of our own. It naturally follows that the adjective derived from that verb is "Googleable," referring, of course, to information that can be found by "Googling." It is just a short jump to arrive at the antonym "non-Googleable," referring to information that cannot be found by Google. And there is a lot of it!  Analysts who study this topic estimate that over 90 percent of the web is non-Googleable.

By the way, I want to make it clear that I am not saying anything more about Google than Google is saying about itself. Google founder Larry Page personally delivered a speech at the 2007 annual meeting of the American Association for the Advancement of Science (AAAS) where he lamented that much of science is not available for Google to retrieve. The July 27, 2007, issue of “Science” Magazine presented an article by a Google Research Director who acknowledged the same thing.

I coined the term non-Googleable because the concept is so critically important to science. It turns out that great quantities of science knowledge are non-Googleable. This observation is profoundly important for science in general – and for my organization in particular.

The limitations of Googling are inherent in the underlying technology used by Google and by each of the other conventional search engine companies to index the web. To get ready for searches by web patrons, a web crawler (or "spider" or "robot") visits many web sites, mostly by following links. An index of each such site is thus created, slowly building a vast composite index of all the sites visited. Later, when web patrons perform a search, they are actually searching the composite index.

Difficulty arises because vast numbers of web pages cannot be accessed by following links. In other words, such web pages are not crawlable.  For example, to find an e-print on a database of e-prints, it is typically necessary to enter a search term on the front page of the database. At this point, a crawler is stumped. As a consequence, the content of the database is not accessed by the crawler, and that content is non-Googleable.

Google well recognizes this problem and it implicitly acknowledges that Google alone cannot solve it. Rather, Google implores database owners to take special steps to accommodate its crawlers. Some owners take such steps, others do not. But, the root cause of the problem is that crawling technology is inherently limited.

As search capability is key to our mission,   the limitations of crawling have motivated us to find another way to make information in multiple databases searchable. It is called federated search. Federated search allows users to search multiple data sources simultaneously, in parallel, using a single query from a single user interface. Unlike the Google solution, federated search places no burdens on database owners.  Already, vast new virtual collections have been opened  that were heretofore unsearchable as a practical matter.

Specifically, we have developed products that use federated search to search web information which is unavailable through Google. Federated search is used in gateways such as Science.gov and WorldWideScience.org, which make searchable enormous quantities of non-Googleable scholarly scientific and technical information.

Here is how federated search works. A web patron seeking science information opens a portal search tool like Science.gov and enters a query, just as he or she would do at Google. But, while the patron's experience looks like Google, the architecture behind federated search is entirely different.

The query is transmitted to a central server–in Oak Ridge, Tennessee, in the case of Science.gov and WorldWideScience.org–and then it is fanned out to each of a suite of databases geographically spread out across the US, or even the entire world.

At each database, the query causes a search to be executed and produces a hit list of search results summaries which might include title, author and snippet. The hit list is then transmitted back to the central server, where the hits are relevancy ranked and sent on to the Web patron. So, in the span of about 20 seconds, the query has been transmitted to numerous databases, searches executed at these databases, and the results brought back and ranked for the patron.

Federated search is inexpensive to implement, places little burden on the database owner, and allows for fielded searching which provides users who know very specifically what they're looking for (e.g. author, title, or publisher) the capability to perform a precision search.

WorldWideScience.org is a brand new global science gateway that relies on federated search. It is built on the same architecture as Science.gov—the U.S. science gateway—but taken to the international level.

Suppose you want to search international databases for science information on, say, "electric vehicles" using batteries. You go to WorldWideScience.org, a gateway which now has over 40 portals from over 50 countries. You enter your search term into the WorldWideScience.org search box. The query is immediately sent to all the portals in parallel, in real time, within WorldWideScience.org.  The results of the searches of all the portals are returned to your desktop, compiled and ranked for you by relevance.

Federated Search has a number of advantages:

• Topic-specific resources are selected in advance

• Resources are filtered for quality in advance

• Current, real-time results

• No burden for database owner

• Inexpensive to implement

• No need for user to know about resources

• No need for user to search resources one at a time

• Allows for fielded searching

• Interoperability is automatically achieved

Even at this early stage, WorldWideScience.org searches across about 400 million pages of important scientific portals worldwide. That's a lot of science information accessible from one search box—equivalent to a shelf of documents 20 miles long. This is the first time of which we're aware that federated searching has been accomplished on a global scale.

Without WorldWideScience.org to search the national portals, information customers faced a task so forbidding that it was a practical impossibility.  Without WorldWideScience.org, customers would have to overcome three formidable roadblocks. First, to search individual national portals, they had to know that those portals exist. We have yet to encounter anyone who knew more than a few such portals. 

But let's do some arm waving and magically assume that roadblock away. Let’s assume for a moment that the information customer somehow knows about all 40 national portals. Then the customer would face the second formidable task of visiting each portal and searching it one by one. The customer would face a daunting task.

But let‘s do some more arm waving and again magically assume this roadblock away, too. Let’s assume for a moment that the information customer did visit and search each portal. Then, the customer would be faced the third roadblock of sorting through 50 long hit lists--- yet another imposing task.

Thus, WorldWideScience.org changes a practical impossibility to an easy and rewarding function by searching portal upon portal of science information typically not searched by conventional search engines, in parallel, with only one query, ranking the results, and thus saving tremendous time and effort.

So where once we had isolated portals of information, we now have portals working as a unit, an integrated whole. Federated search, through gateways such as WorldWideScience.org and Science.gov, speeds communication, accelerates discovery and expedites scientific and economic progress.

So, to re-cap: Google by-and-large does not search within scientific databases, which are thus non-Googleable. Federated search opens a part of the web that is non-Googleable.

Of course, there is much to be done. The world is dotted with large and often isolated, web-based collections of scientific information. Once found, any one of these databases can be searched. But finding specific databases is a challenge, and searching them all collectively, until just recently, was only a dream. Now this dream is within reach.

Federated searching has its own set of limitations. It has advanced very rapidly over the last few years and should continue to do so. However, neither crawling nor federated searching is a panacea. Federated searching does things that crawling can't do, and crawling does things that federated searching can't do—they are complementary technologies.

Portals like WorldWideScience.org, which make searchable non-Googleable information via federated search, are complementary to conventional search tools like Google and Yahoo! that rely upon crawling. 

So, what is next? There is no inherent reason that a single tool cannot rely upon both a crawled index and a live federated search in parallel. Indeed, OSTI’s largest product does just that. It is the E-print Network.  All in parallel, it searches 1.5 million e-prints that have been crawled plus an addition 5 million e-prints hosted in 50 e-print databases, comprising in all about 100 million pages. As far as we know, there is no other tool in the world that virtually integrates such a quantity of e-prints. Further, we are not aware of another publicly available search tool that searches federated databases and crawled indexes in parallel.

Please note that the crawling done by the E-print Network is different from that done by conventional search engines. The E-print Network crawls only those sites of known quality.  Such filtering produces a high quality search tool. There would seem to be great potential to build on this theme of combining into a single information product searches of crawled indexes and federated search of databases. 

Here is one potential application. We estimate that the amount of scholarly science searched by WorldWideScience.org is about the same magnitude of science searched by Google. It would be technologically possible to combine WorldWideScience.org and crawled indexes. In addition, it would be technologically straightforward to add in more federated search tools to make an enormous search tool. The builders of this Uber tool of the future would need to be careful about relevance ranking, but that challenge is manageable.

Thus, the combination of crawled indexes and federated searches is an extremely promising path to the future. A billion-page, high quality science search tool may soon be available to further accelerate the progress of science.

Walt Warnick

 

 

Comments:

Fantastic article!! Especially the discussion of ePrints Network, what so much of my work time is spent on; gives me further respect for it. Thank you.

Posted by Doug LaVerne on August 18, 2008 at 08:48 AM EDT #

Post a Comment:
  • HTML Syntax: Allowed

We welcome your comments and look forward to civil discourse on a variety of science and technology information topics. We will review comments before posting and we reserve the right to not post comments. You are fully responsible for everything that you submit in your comments, and all posted comments are in the public domain. This means that your comments could be distributed widely.

By providing the correct answer to this math question, I accept these terms and conditions for comments I submit to the OSTI Weblog.