Related Resources at the Library

InQuery and Relevance Ranking

Networked InQuery and Client-Server Architecture
Advantages of Using a Distributed InQuery

Relevance Ranking
The Concept of Belief Value (Weight) in Relevance Ranking
Natural Language Queries
Advanced Structured Queries
Search Statement Operators
Boolean Operators
Proximity Operators
Field Operator
Relevance-Ranking (Weighted) Operators
Filter Operators
Combining Search Operators

What Is InQuery?

InQuery is a probabilistic information retrieval system developed at the University of Massachusetts at Amherst that is now available commercially from Sovereign Hill Software. Given a database of text documents, InQuery can retrieve those documents most relevant to a user's query. The documents do not need to be manually classified, though they do need to be indexed by InQuery's parsing system.

User queries can be formulated in natural language (i.e., regular English phrases and sentences) or in a more exact structured query language. InQuery provides a set of development tools for creating and customizing sophisticated search and retrieval applications.

Networked InQuery and Client-Server Architecture

InQuery supports networked versions of its retrieval engine. The architecture consists of a connection server, an InQuery server, and client applications. The connection server manages service requests from any number of clients to one or more InQuery servers. The servers are the retrieval engines for one or more InQuery databases.

InQuery's client-server architecture supports a networked or distributed computing environment (DCE). At a minimum, DCE consists of communication between two (or more) computers. An InQuery client application is basically a user interface capable of initiating a network connection to a connection server by sending requests and receiving replies from the server. Web servers like Netscape and NCSA-HTPD can function as InQuery clients through CGI (Common Gateway Interface) programs, which use the InQuery API (Application Programmer's Interface). InQuery API functions allow system designers to create several different types of InQuery client interfaces, including those to a single InQuery database (e.g., THOMAS), a multi-database connection (e.g., American Memory) and an administrative client.

An InQuery server functions as the retrieval engine that processes queries forwarded by the connection server from the user clients. All communication with the InQuery server is through the connection server. The InQuery server opens a database and services requests sent to it by evaluating a client's query or retrieving a specified document. Several InQuery servers can run on a single host and communicate with a single connection server.

Advantages of Using Distributed InQuery Databases

Client-server InQuery also allows access to resources that may be distributed across host networks. In a networked environment, InQuery supports the placement of an InQuery server on the host(s) where the InQuery database(s) reside in order to reduce the processing of potentially large databases across the network. The benefits of using a distributed InQuery include:

Clients do not require substantial resources to access InQuery databases.
InQuery server(s) can be resident on the same host(s) as InQuery database(s), thereby reducing traffic on the network.
Services requested across the network should be just as fast, perhaps faster, than if they were requested by accessing a local disk.
Multi-database requests to InQuery servers on different machines are processed in parallel.

How Does InQuery Work Under Thomas?

How InQuery works in any given application depends on how it has been developed, what network architecture it has and what algorithms have been employed to assign and calculate document weights. The discussion below attempts to document the basic development of THOMAS as an InQuery application.

THOMAS accepts search queries of two types: basic "natural language queries" and more advanced structured queries. The natural language queries allow the user to simply type the information request as an English sentence or series of terms or phrases. The Library of Congress InQuery programmers have designed an algorithm that is used by the InQuery query-processor to transform these queries into a structured form that can then be processed by the query engine. This natural language query will be the choice of most searchers.

However, by putting a query into structured form directly, the user is able to provide a precise definition of term relationships in the query, possibly resulting in improved performance. This requires a knowledgeable user to properly formulate a query using the special operators provided.

Relevance-Ranking Under THOMAS

Relevancy-ranked searching is conceptually different than traditional Boolean searching: search results are not simply those containing the key words searched; instead, a whole range of documents are returned with what are considered to be the most relevant being displayed first.

The Concept of Belief Value (Weight)

Because InQuery operators can be combined and nested to produce desired search results, and because they include both traditional Boolean and sophisticated relevancy ranking operators, InQuery documentation often speaks of the "belief value," more commonly called "the weight" of a document. Thus, a document returned in response to a user's query is neither "right" nor "wrong," "good" nor "bad," "relevant" nor "irrelevant." Instead, it can be statistically evaluated and ranked according to a quantifiable belief value in its specific relevance to the query with respect to all the other documents in the database.

In a search, each document in the database is assigned its weight as a mathematical value between 0 and 1 according to:

The overall system algorithms designated by the system designer in a natural language search; or,
By the search operators invoked by the users own search strategy in an advanced structured search.

The overall weight of a document, which is calculated from the weight of each search term, determines if that document appears in the search results list and where in the list it appears – at the top, middle, or bottom. The system designer sets a minimum weight or threshold so that documents with weights below that value do not appear in the search results; otherwise, every document in the database would always appear in every search result. THOMAS currently set this value to 0.4.

Natural Language Queries

In general, for a natural language query, InQuery calculates the weight of each term for a given document by dividing the number of times the term appears in the document (term frequency) by the number of documents in which the term appears (inverse document frequency). A factor is added to compensate for the size of the document. The weights of all the terms in the search statement are then averaged to give an overall weight to the document and rank it in the search results. (The 250 or so stopwords maintained in a THOMAS stopword list are not weighted and are not reflected in search results.)

In a natural language query, you can simply type the search terms you are looking for as key words or as a complete sentence. InQuery automatically transforms these phrases or sentences into a more structured form, which is then processed according to certain rules or "algorithms."

The algorithms currently assigned as defaults in THOMAS are configured as follows:

If all the search terms occur in proximity as an exact phrase, the phrase weight is multiplied by 90.
If all search terms occur within a fixed-size unordered window, their combined weight is multiplied by 45 times. The window size is determined by the number of words in the search, with the window increasing in size by 10 words for each term in the query, e.g., the phrase "Library of Congress" has a search window of 20 words (since "of" is a stopword).
The entire algorithm (1 and 2 above) is repeated for title fields alone, and weights are multiplied by an additional 20 if the construct appears in the title. Thus, if an exact phrase appears in the title, its weight is multiplied by 1800 overall (90 x 20) to move that particular document to the top (or close to the top) of the list.

Under the existing THOMAS search algorithm, short phrases may be one of the best ways to search with natural language queries. Search logs from THOMAS show that short phrases are the most common form of user query.

Advanced Structured Searches

In contrast to natural language searches, advanced structured queries allow you to assign powerful search commands or "operators" to your search statement. By inputting your query directly in structured form, you can gain greater control over the relationship of the terms in the query vis-à-vis the search results. This can dramatically improve search performance; however, direct input requires knowledge of the special operators used by THOMAS's InQuery search engine. Advanced structured searches using InQuery operators override the default relevance-ranking algorithms and radio and checkbox selections.

Unlike natural language searches, advanced structured searches must begin with a "#" sign. Document weights in structured searches are calculated by running all the operators in a search query - Boolean, proximity and relevancy - against all the documents in a database and then ranking the documents in the order of highest value. In THOMAS, the user can determine the number of documents to be retrieved; the default is 100 documents, and uses can increase this number up to 1000, a limit set by the system designer. In any case, documents with the highest weight are presented first, then those of lesser value and so on, until the number of documents requested or the number of documents that satisfy the search has been displayed, whichever is less. Alternatively, in other InQuery applications, a relevancy threshold can be set by the user, and then all documents with a weight above that threshold will be retrieved. All those with a weight beneath the threshold will not be retrieved, with a variable number of documents retrieved for each search.

Search Statement Operators

InQuery supports 19 different special search statement operators, 18 of which are currently implemented in the THOMAS service. These include the basic Boolean commands of "and," "or" and "not;" phrase, adjacency and proximity searching. In addition, there is a whole series of operators designed to enable users to improve relevancy-ranked search results. Advanced structured searches using InQuery operators override the default relevance-ranking algorithms and radio and checkbox selections.

All InQuery search query operators must be preceded by a "#" sign. The search query operators currently available in THOMAS are:

Boolean Operators

Boolean AND Operator

Form: #band(T1 . . . Tn)
Example: #band (baseball football hockey)
Explanation: All three terms within the parentheses must appear in a document for it to be retrieved.

Boolean OR Operator

Form: #or(T1 . . . Tn)
Example: #or(baseball football hockey)
Explanation: Any one or more of the three terms within the parentheses must appear in a document for it to be retrieved.

Boolean AND NOT Operator

Form: #bandnot (T1 . . . Tn)
Example: #band (football baseball)
Explanation: The first word within the parentheses must be present in the document, but not the second, for the document to be retrieved. The traditional Boolean "not" operator, which excludes any document containing a specified term from the search results list in InQuery, is used only in conjunction with the "and" operator -- thus, the name Boolean "and not" operator.

Proximity Operators

Ordered Distance (Proximity) Operator

Form: #N (T1 . . . Tn) or #odN (T1 . . . Tn)
Example: #3(library congress)
Explanation: The "Ordered Distance Operator" requires that the first term be within a "window" of #N words of the second term, and that it must precede the second term to be afforded to receive a higher weight. In the example, any form of the word "library" must be within 3 words of and precede the any form of the word "congress" to be accorded a higher weight. InQuery counts stopwords in its proximity calculations. The documents containing only the phrase "congressional library" (not "Library of Congress") would not be weighted, as the terms are not ordered as in the query.

Unordered Window Operator

Form: #uwN(T1 . . . Tn)
Example: #uw150(chesapeake watershed pollution erosion water quality)
Explanation:The documents given the highest belief value (weight) will be those with any form of the terms in the parentheses within a "window" of 150 words in any order.

Ordered Phrase Operator

Form: #phrase (T1 . . . Tn)
Example: #phrase(certified public accountant)
Explanation: Terms within this operator are evaluated to determine if they occur together frequently in the database. If they do, the operator is treated as an order distance operator of 3 (#od3). If the terms are not found to co-occur in the database, the phrase operator is converted to a "sum" operator.

Passage Operator

Form:#passageN (T1 . . . Tn)
Example:#passage50(student financial aid assistance grants loans)
Explanation: The Passage Operator looks for the terms within the parentheses to be within a window of "N" words in any order. Every document is segmented into the passage window of user-defined length, and each passage is treated as a separate document in the InQuery database. The original document is then weighted based upon the scores of its "best passages" and ranked accordingly. In the THOMAS Text of Legislation and Congressional Record databases, this operator is used to bring back the "Best Sections" (selectable from the "navigation panel") -- that is, those passages of the bill or record best matching the terms in the user's query.

Literal Operator

Form: #lit(T1 . . . Tn)
Example: #lit(Budget Reconciliation Act of 1995)
Explanation: This operator gives an exact match to the phrase contained in it, without any stemming of the words and without discarding stopwords. Since THOMAS currently employs only stemmed databases in its full-text files, the #lit operator will not work under THOMAS full-text files. This is because InQuery is only indexing word stems rather than entire words. The THOMAS system will be implementing a synonym dictionary to replace the stemmer now in use. THOMAS files will reindex words in their entirety (unstemmed words), so users will have a choice of finding exact phrases (the #lit operator will be used "behind to scenes" to accomplish this) or "words plus variants" by invoking the synonym dictionary.

Field Operator

Form: #field(fieldname #REL-OP T1 . . . Tn)
Example: #field(TITLE welfare)
Explanation: The terms contained in the field operator are searched only within the fieldname specified. The relational operator (REL-OP) allows fields to be searched for a range of values. If the REL-OP is not used, equality is used by default.

Relevancy-Ranking (Weighted) Operators

SUM Operator

Form: #sum(T1. . .Tn)
Example: #sum(france french nuclear atomic testing)
Explanation: SUM is the simplest of search operators and is the default operator assigned by InQuery when it receives a natural language search statement. Each term is assigned a weight based on the number of occurrences of the term in the document divided by the number of documents in which the term has occurred. Each search term contained within the parentheses following the sum operator is treated as having an equal influence on the final result, i.e., their weights are averaged to produce the single weight of the document as a whole. (Each document returned need not return all terms within the #sum operator, but the more words it contains, the higher its weight.)

Weighted SUM Operator

Form: #wsum(Ws T1. . . Wn Tn)
Example: #wsum (1.0 0.4 railroads 0.4 airlines 0.1 trucks 0.1 automobiles)
Explanation:The #wsum operator allows the user to define weights for (associate more importance to) various terms in the query. The weight associated with the entire query is 1.0. Of this total, an equal weight of 0.4 is assigned to "railroads" and "airlines," while a lesser equal weight of 0.1 is assigned to "trucks" and "automobiles." This operator should be used when the user is most interested in documents containing any form of the words "railroads" or "airlines"; less interested in documents containing the any form of the terms "trucks" or "automobiles."

Relevancy AND Operator

Form: #and(T1 . . . Tn)
Example: #and(loans grants aid assistance Pell Stafford student education)
Explanation: The more terms within the #and operator parentheses found in the document, the higher the weight of that document. If one or more terms are not found, the document is still retrieved, but its weight is lowered. Some searchers have called this a ""fuzzy and" to distinguish it from the more conventional Boolean "and."

NOT (Negation) Operator

Form:#not (T1 . . . Tn)
Example: #not(budget debt appropriations authorization)
Example: #sum(dumping#not(garbage refuse landfills trash disposal))
Explanation: This "not" operator behaves differently than the traditional Boolean "not" operator. The words within the parentheses are negated so that documents that do NOT contain these words are weighted more highly than those that do. The "not" operator is best used in compound searches, in combination with the #sum operator, for example. In the first example, the #not operator increases the weight of any documents not containing any of the words "budget," "debt," "appropriations" or "authorization," while in the second example, documents with the phrase "dumping" will be ranked highly, whereas those containing references to "garbage," "refuse," "landfills," "trash" and "disposal" will be ranked lower on the list, or not at all, if their weight falls under 0.4 in THOMAS.

Synonym Operator

Form:#syn(T1 . . . Tn)
Example: #syn(lawyer attorney)
Explanation: The terms of this operator are treated as instances of the same term. Rather than calculating a separate weight for each term, synonyms are given a joint weight calculated by dividing the total number of occurrences of all the synonym terms by the total number of documents in which the synonyms occur. In a database where the term "lawyer" occurs 10 times in 5 documents (10/5 = 2) and the term "attorney" occurs 15 times in 3 documents (15/3 = 5), treated as distinct terms, an average weight of 3.5 results: (2+5 = 7; 7/2 = 3.5) This would be the result if the #sum operator were used. However, if the words are treated as synonyms, a weight of 3.2125 results [ (10+15)/(5+3) = 25/8 = 3.125].

MAX Operator

Form: #max(T1 . . . Tn)
Example: #max(executive manager officer director))
Explanation:The maximum weight of all the individual weights of terms in parentheses is taken to be the weight of this operator. This search for the concept of manager will take the highest weighted synonym among "executive, "manager," "officer" and "director" in order to determine the relevancy of each document.

Weight Plus Operator

Form: #+ T1
Example: #sum(militia#3(domestic terrorism)#+wiretap)
Explanation: The effect of using the Weight Plus operator before its term is to increase its weight relative to the rest of the query. It is useful in complex searches in complex searches when a particular term may be under-represented in the initial search results and you want to give that particular term more weight in your next search. In the example, more weight is given to the term "wiretap" than the other two phrases in the search statement.

Weight Minus Operator

Form: #- T1
Example:#sum(waste#-hazardous#-toxic#-nuclear)
Explanation:The Weight Minus operator decreases the weight given to terms that follow it, thereby decreasing the overall weight of documents containing the terms relative to other documents retrieved by the query. Only one term can follow each minus operator. While serving a function similar to NOT, MINUS decreases the weight of documents containing the term; NOT, on the other hand, increases the weight of documents which do not contain the term.

Filter Operators

Filter Require Operator

Form: #filreq(arg1 arg2)
Example: #filreq (#uw10(defense appropriations) #uw10(defense authorization))
Explanation:The Filter Require Operator shows the documents returned (belief list) of the first argument if and only if the second argument would return documents. The value of the second argument does not affect the weights of the first statement, only whether the search results will be returned or not. Thus, a user can ask for everything in search statement A if and only if there are results for search statement B. The example query returns only documents that have any form of the words "defense" and "appropriations" within 10 words of each other (in any order) if and only if there are documents that have the words "defense" and "authorization" within 10 words of each other (in any order).

Filter Reject Operator

Form: #filrej(arg1 arg2)
Example: #filrej(#uw10(defense appropriations) #uw10(defense authorization))
Explanation: The Filter Reject Operator shows the documents returned (belief list) of the first search argument if and only if NO documents are returned by the second argument. The search results of the second arguments are not affected by those of the first, only whether or not they are returned. This command enables users to search in the alternative rather than sequentially, i.e., if there is nothing on search statement B, then give me the results for search statement A instead. The example query returns only documents that have any form of the words "defense" and "appropriations" within 10 words of each other (in any order) if and only if there are NO documents that have the words "defense" and "authorization" within 10 words of each other (in any order).

Combining Search Operators

The 19 operators currently available in THOMAS can be combined and nested to produce desired results. For example, a simple structured query might be a SUM of a term and a WORD PROXIMITY ordered distance operator:

#sum(reform #2(health care))

Explanation: This query would find documents that contained the term "reform" and the terms "health" and "care" occurring no more than 2 words apart.

A BANDNOT search might involve nested commands:

#bandnot(dumping #sum(garbage refuse disposal landfill))

Explanation: Documents containing the term "dumping" will be retrieved but not if they contain the terms "garbage," "refuse," "disposal" or "landfill."

A #wsum search allows the user to add extra importance to one phrase in a query:

#wsum (20 15 #1 (White House) 5 #phrase (historic preservation)

Explanation: This query places emphasis on the "White House" concept and less on the concept of "historic preservation."

Some operators, however, are restricted in which type of operators they are allowed to contain. A primary rule in formulating structured queries is that "belief operators" may not occur inside of "proximity operators."

That is, operators that require positional information cannot contain operators that return more general results. Thus, the word proximity #OD (#n), #UW (#uwn), #PHRASE, #PASSAGE and #SYN operators cannot contain #SUM, #WSUM, #AND, #BAND, #OR, #NOT, #BANDNOT, #FIELD, #FILTER-REQ or #FILTER-REJ operators. Said another way, weight operators may not occur inside of proximity operators. This is because proximity lists (the basic unit of InQuery knowledge) can be converted to weights, but weights cannot be converted to proximity lists.