SRU VERSION 1.1 ARCHIVE
Common Query Language
CQL Version 1.1 13th
February 2004
Sample Queries - BNF - Rules - Features - Conformance - Context
Sets - the CQL Context Set - Relations - Modifiers - Masking - Result
Sets - Proximity
CQL is a formal language for representing queries to information retrieval
systems such as web indexes, bibliographic catalogs and museum collection
information. The design objective is that queries be human readable and
writable, and that the language be intuitive while maintaining the expressiveness
of more complex languages.
Traditionally, query languages have fallen into two camps: Powerful,
expressive languages, not easily readable nor writable by non-experts
(e.g. SQL, PQF, and XQuery);or simple and intuitive languages not powerful
enough to express complex concepts (e.g. CCL and google). CQL tries to
combine simplicity and intuitiveness of expression for simple, every
day queries, with the richness of more expressive languages to accomodate
complex concepts when necessary.
Sample Queries
Following are examples of simple CQL queries. These are all self-explanatory:
dinosaur
"complete dinosaur"
title = "complete dinosaur"
title exact "the complete dinosaur"
dinosaur or bird
dinosaur and "ice age"
dinosaur not reptile
dinosaur and bird or dinobird
(bird or dinosaur) and (feathers or scales)
"feathered dinosaur" and (yixian or jehol)
publicationYear < 1980
lengthOfFemur > 2.4
bioMass >= 100
The following are a bit more complicated:
Example |
Explanation |
title all "complete dinosaur" |
Title contains all of the words: "complete", and "dinosaur" |
title any "dinosaur bird reptile"
|
Title contains any of the words: "dinosaur", "bird",
or "reptile" |
(caudal or dorsal) prox vertebra |
A proximity query: either "caudal" or "dorsal" near
'vertebra" |
ribs prox/distance<=5 chevrons |
A more specific proximity query: "ribs" within
5 words of "chevrons" |
ribs prox/unit=sentence chevrons |
"ribs" in the same sentence as "chevrons" |
ribs prox/distance>0/unit=paragraph chevrons |
"ribs" and "chevrons" occuring in the
same document in different paragraphs |
subject any/relevant "fish frog" |
find documents that would seem relevant either to "fish" or "frog" |
subject any/rel.lr "fish frog" |
Same as previous, but use a specific relevance algorithm
(linear regression) |
Formal Definition: CQL BNF
Following is the Backus Naur Form (BNF) definition for CQL. ["::=" represents
"is defined as"]
cqlQuery |
::= |
prefixAssignment cqlQuery | scopedClause |
prefixAssignment |
::= |
'>' prefix '=' uri | '>' uri |
scopedClause |
::= |
scopedClause booleanGroup searchClause | searchClause |
booleanGroup |
::= |
boolean [modifierList] |
boolean |
::= |
'and' | 'or' | 'not' | 'prox' |
searchClause |
::= |
'(' cqlQuery ')'
| index relation searchTerm
| searchTerm |
relation |
::= |
comparitor [modifierList] |
comparitor |
::= |
comparitorSymbol | namedComparitor |
comparitorSymbol |
::= |
'=' | '>' | '<' | '>=' | '<=' | '<>' |
namedComparitor |
::= |
identifier |
modifierList |
::= |
modifierList modifier | modifier |
modifier |
::= |
'/' modifierName [comparitorSymbol modifierValue] |
prefix, uri, modifierName,
modifierValue, searchTerm, index |
::= |
term |
term |
::= |
identifier | 'and' | 'or' | 'not' | 'prox' |
identifier |
::= |
charString1 | charString2 |
charString1 |
:= |
Any sequence of characters that does not include any of
the following:
whitespace
( (open parenthesis )
) (close parenthesis)
=
<
>
'"' (double quote)
/
If the final sequence is a reserved word, that token is
returned instead. Note that '.' (period) may be included, and
a sequence of digits is also permitted. Reserved words are
'and', 'or', 'not', and 'prox' (case insensitive). When a reserved
word is used in a search term, case is preserved. |
charString2 |
:= |
Double quotes enclosing a sequence of any characters except
double quote (unless preceded by backslash (\)). Backslash
escapes the character following it. The resultant value includes
all backslash characters except those releasing a double quote
(this allows other systems to interpret the backslash character).
The surrounding double quotes are not included. |
General Rules
-
CQL Query
A CQL query is essentially a search clause, or multiple
search clauses connected by boolean operators. (In addition
it may include prefix assignments which assign short names to known
contexts. See context sets.)
-
Search Clause
A search clause consists of an index, relation, and search term,
or a search term alone. Thus every search clause has a search term,
but both the index and relation may be omitted - the clause
must include either both or neither of the index and relation.
(Note that the use of the "index" concept in CQL is not
intended to have any implementation implications; it does not imply
the presence of a physical index.)
Examples:
Index/relation/search
term: title = cat
Search term only: cat
-
Search Term
Search terms may be enclosed in double quotes. Search terms must be
enclosed in double quotes if they contain any of the following characters: < > =
/ ( ) and whitespace. The search term may be empty, but must be present
in a search clause. An empty search term is expressed as "" and has
no defined semantics.
-
Index Name
An index name always includes a base name and may also include a
prefix, which provides a context for the index name, the name of
the context set of which the index is a
part. If the context is not supplied, it is determined by the server. If
the index is not supplied it is determined by the server. (Note
that the index may be omitted only when the relation is also omitted.
Either both must be supplied, or both omitted.)
Examples:
title = cat context
determined by the server
dc.title = cat index context
is dc
cat context
and index determined by the server
-
Relation
The relation in a search clause specifies the relationship
between the index and search term. It also always includes
a base name and may also include a prefix providing a context for
the relation. If a relation is supplied with no accompanying context,
the context is 'cql' (the cql
context set). If no relation is supplied, then cql.scr
(server choice) is assumed, which means that the relation is determined
by the server. (Note that the relation may be omitted only when the
index is also omitted. Either both must be supplied, or both omitted.)
Examples:
title = cat context for relation is 'cql'
; fully qualified relation is cql.=
title cql.any cat relation
is 'any'; relation context is 'cql'. Equivalent to: title
any cat
cat index and relation
are determined by the server (formally the relation is 'cql.scr')
- Relation Modifiers
Relation modifiers may accompany a relation. These also may be accompanied
by a context. If a context is not supplied for a modifier,
the default is the cql context set.
Relation modifiers are separated from each other and from the relation
by slashes ( /). Whitespace may be present on either side of a /
character, but the relation plus modifiers group may not end in a
/.
Examples:
dc.title any/relevant/rel.CORI "cat
fish"
the relation
'any' is modified by (1) 'relevant' whose context is 'cql' and (2)
'CORI' whose context is 'rel'.
dc.author exact/stem "smith, j." the
relation 'exact' is modified by 'stem' whose context is 'cql'.
-
Boolean Operators
Search clauses may be linked by boolean operators. These are: and, or, not and prox.
(Note that not is really and-not, that is,
it may not be used as a unary operator.) Boolean operators all have
the same precedence; they are evaluated left-to-right. Parentheses
may be used to overide left-to-right evaluation.
-
Boolean Modifiers
As a relation may have modifiers, similiarly, a boolean operator
may have modifiers, separated by '/' characters. Boolean modifiers
may come from any context set. If not supplied, the context is
the CQL context set. (Note that
Boolean operators themselves are limited to the built-in set of
four.)
Example: dc.title=cat and/rel.sum dc.title=dog
-
Case Insensitive
All parts of CQL are case insensitive apart from user supplied search
terms, which may or may not be case sensitive. 'OR','or', 'Or'
and 'oR' are all the same boolean operator, just as 'dc.title',
'DC.Title' and 'dC.TiTLe' are all the same context set plus index
name.
Additional CQL Features
The following are all formally defined by the CQL context set but described
here for convenience.
Relations
For ordered (e.g. numeric) terms:
<, >, <=, >=,
and <> mean "less than", "greater than", "less
or equal", "greater or equal", and "not equal".
when the term is a list of words:
-
'=' is used for word adjacency -- the words
appear in that order with no others intervening. (Note the
dual use of '=', it is used for numeric equality as described above.)
-
'any' means "any of these words"
-
'all' means "all of these words"
When the term is a character string:
'exact' is used for exact
string matching.
When the term has multiple dimensions:
'within' may be used to search for values that
fall within the range, area or volume described by the search term.
When the index's data has multiple dimensions:
'encloses' may be used to search for values
of the database's term fully encloses the search term.
Examples:
This query |
Would match this |
but not this |
title = "cat in the hat" |
"a day in the life of the cat in the hat" |
"hat in the cat" or "cat in the green hat" |
title all "cat hat" |
"hat in the cat" |
"cat in the grass" |
title any "cat hat" |
"cat in the grass" |
"dog in the grass" |
title exact "cat in the hat" |
"cat in the hat" |
"a day in the life of the cat in the hat" |
date within "2002 2005" |
2004 |
2006 |
dateRange encloses 2003 |
"2002 2005" |
"2004 2005" |
Relation Modifiers - Term Functions
These relation modifiers request that the server perform some algorithm
on the term before processing.
-
stem
The server should apply a stemming algorithm to the words within the
term. for example, walked, walking, walker etc. would all
be represented by the stem word walk. This allows a search like
title =/stem "these completed dinosaurs" to match The Complete
Dinosaur.
-
relevant
The server should use a relevancy algorithm for determining matches
and the order of the result set.
Example: subject any/relevant "fish frog"
would find records relevant to "fish" or "frog" and
order the result set by relevance to fish or frog.
Relation Modifiers - Qualifiers
These modifiers qualify the relation to more precisely determine its
semantics.
-
word
The term consists of words (rather than being an opaque string).
-
string
The term is a single item, and should not be broken up.
-
isoDate
Each item within the term conforms to the ISO 8601 specification for
expressing dates.
-
number
Each item within the term is a number.
-
uri
Each item within the term is a URI.
-
masked
This means that the masking rules (see next) apply. Masking is assumed
even if not specified, unless 'unmasked' is specified (so there
is never any reason to include 'masked').
- unmasked
Do not apply masking rules.
Masking Rules
-
A single asterisk (*) is used to mask zero or more characters.
-
A single question mark (?) is used to mask a single character, thus
N consecutive question-marks means mask N characters.
-
Carat/hat (^) is used as an anchor character for terms that are word
lists, that is, where the relation is 'all' or 'any', or '=' when
used for word adjacency. It may not be used to anchor a string, that
is, when relation is 'exact' (string matches are, by definition, anchored).
It may occur at the beginning or end of a word (with no intervening
space) to mean right or left anchored."^" has no special meaning when
it occurs within a word (not at the beginning or end) or string but
must be escaped nevertheless.
-
Backslash (\) is used to escape '*', '?', quote (") and '^' , as
well as itself. The use of a backslash not followed immediately by
one of these characters is reserved for future definition.
Masking examples:
-
dc.title = c*t (matches cat and coast etc.)
-
dc.title = c?t (matches cat and cot, not coast or ct)
" ?" (matches any single character)
-
dc.title = "^cat in the hat" (matches 'cat in the hat' where it
is at the beginning of the field)
-
dc.title any "^cat eats rat" (matches 'cat eats rat', 'cat eats
dog', 'cat', but not 'rat eats cat')
-
dc.title any "^cat ^dog eats rat" (matches 'cat eats rat', 'dog
eats cat', 'cat loves bat', but not 'bat loves cat')
-
dc.title = "\"Of Couse\" she said"
Result Set Name Used in Query
A search clause may be a result set name. This is a special case, employing
the context set 'cql'. The index and
relation are expressed as "cql.resultSetId =" and the term is
a result set name that has been returned by the server in the 'resultSetName'
parameter of the response. It may be used by itself in a query to refer
to an existing result set from which records are desired. It may also
be used in conjunction with other resultSetName clauses or other indexes,
combined by boolean operators. The semantics of resultSetId with relations
other than "=" is undefined.
Example: cql.resultSetId = "resultA" and cql.resultSetId = "resultB"
Proximity
The proximity boolean boolean operator is expressed in terms of distance,
unit, and ordering.
Examples:
- dc.title = "cat" prox/distance=1/unit=word dc.title = "in"
- "cat" prox/distance>2/ordered "hat"
distance takes the form:
distance [relation] [value]
where relation is one of: "<", ">" ,"<=" ,">=" ,"=" , "<>"; default "<="
and value is a non-negative integer; default: 1 for word, zero otherwise
unit takes the form
unit=[value]
where value is one of "word", "sentence", "paragraph", or "element"(default "word"),
ordering is "ordered" or "unordered"; default "unordered"
CQL Context Sets
Context sets permit CQL users to create their own indexes, relations,
relation modifiers and boolean modiers without fear of chosing the same
name as someone else and thereby having an ambiguous query. All of these
four aspects of CQL must come from a context set, however there are rules
for determining the prevailing default if one is not supplied. Context
sets allow CQL to be used by communities in ways which the designers could
not have foreseen, while still maintaining the same rules for parsing
which allow interoperability.
When defining a new context set, it is necessary to provide a description
of the semantics of each item within it. While context sets may contain
indexes, relations, relation modifiers and boolean modifiers, there is
no requirement that all should be present; in fact it is expected that
most context sets will only define indexes.
Each context set has a unique identifier, a URI. When sending the context
set in a query, a short form is used. These short names may be sent as
a mapping within the query itself (see next), or be published by the recipient
of the query in some protocol dependent fashion. The prefix 'cql' is reserved
for the CQL context set, but authors may wish to recommend a short name
for use with their set.
An index, relation, or modifier qualified by a context is represented
in the form prefix.value, where prefix is a short
name for a unique context set identifier.
Binding Short Name to URI
The binding of short name to URI is defined either within the
query or by the server. A prefix map may occur at any place in the query
and applies to anything which follows. Example:
>dc="http://www.dublincore.org/" dc.title = "cat"
In the following query:
>a="http:/x.com/y" a.title=cat and (>a="http:/f.com/g" a.title=hat)
and a.title=rat
both the "a" in "a.title=cat" and in "a.title=rat" refer
to http:/x.com/y, while the "a" in "a.title=rat" refers
to http:/f.com/g.
Default Context
When no context is attached to a relation, relation modifier, or boolean
modifier, the context is the cql context set. When no context
is attached to an index the context is determined by the server.
Conformance
In order to claim conformance to CQL a server must support one of the
following three levels:
Level 0
-
Must be able to process a term-only query.
(The term is either a single word or if multiple words separated by
spaces then the entire search term is quoted). If the term includes
quote marks, they must be a escaped by preceding them with a backslash,
e.g."raising the \"titanic\"".)
-
If an unsupported query is supplied, must be able to respond with
a diagnostic to say that the query is not supported.
Level 1
-
Support for Level 0.
-
Ability to parse both:
(a) search clauses consisting of 'index relation searchTerm'; and
(b) queries where search terms are combined with booleans, e.g. "term1
AND term2"
-
Support for at least one of (a) and (b).
Note that (b) does not necessarily include queries such as:
index relation term1 AND index relation term2
but rather queries where the search clauses are terms-only (do not include
index or relation).
Level 2
-
Support for Level 1.
-
Ability to parse all of CQL and respond with appropriate
diagnostics.
Note that Level 2 does not require support for all of CQL, it
requires that the server be able to parse all of CQL (and respond
with proper diagnostics for the parts not supported.).
|