The current Lemur Toolkit search interface allows users to put in simple queries, but is also flexible enough to allow full Indri Language queries. The Indri Query language has quite an extensive array of query options for the more advanced user who feels comfortable with formulating their own advanced queries. See the technical introduction for more information and a full listing of the advanced operators.
If you want to write simple queries, the Lemur Toolkit Search Engine CGI will accomodate you as well. For a basic query, just type in the terms you wish to search on. For example, if you wish to search for the terms oil and fields , type the two terms into the search box and press the "Search" button.
If you wish to search for a specific phrase, you can wrap your terms using the ordered window operator. For example, if you searched for:
#1(white house)
Your search results would return only those documents where the terms "white" and "house" appear in that order and within one term of each other. Using this notation, you can also specify larger window sizes. For example, if you typed:
#5(white house)
The search results would return any documents where the term "white" occurs before "house" and they are within five terms of each other.
Much like the #N operator above, you can use the #uwN operator to perform a search on terms that occur within a certain window size, but in any order. For example, if we wanted to look for the terms "white" and "house" that occured within 5 terms of each other, but we did not care if the term "white" came before "house" or not, we would write this as:
#uw5(white house)
By default, the search engine will return a document if any of the terms occur in the document; documents that contain more terms will generally be ranked above documents that contain fewer terms. If you wish to specify that all of your search terms must be included, you can use the "boolean and" operator (#band). For example, if you want to ensure that the terms "white" and "house" both exist, use:
#band(white house)
Also note that you can mix and match query operators. For example, if you wanted to search for documents that must have the terms "oil" and "fields" and that optionally contain the phrase "white house", you could write this as:
#band(oil fields) #1(white house)
The search engine allows you to search against all of the text associated with a document (the default condition), or to restrict your search to just certain portions ("fields") of the document. There are two different ways of searching against fields that behave somewhat differently.
Field Restrictions on Query Terms:
The restriction indicating what to match against is appended to the
term or query operator. For example:
lemur.title project.title (return documents that have titles containing "lemur" and "project") #band(lemur project).title (return documents that have titles containing "lemur" and "project") #1(cs cmu edu).url (return documents that have urls containing the adjacent terms "cs" "cmu" and "edu") lemurproject.url indri (return documents that have "lemurproject" in the url and "indri" anywhere in the document)
Extent Restrictions on Query Operators: The restriction indicating what to match against appears between the query operator and its arguments. For example:
#combine[title](lemur project) (return titles that contain "lemur" and "project") #or[body](ios android) (return bodies that contain "ios" or "android") #weight[url](1.0 cs 2.0 cmu) (return urls that contain "cs" and "cmu", but give higher weight to "cmu")
There are three important differences between the two methods.
Documents vs. Extents:
The first method returns documents. The second method returns the extent
that you specified. If a document contains multiple extents
that match your query, each is returned separately unless you
combine their scores to produce a document score. For example:
#combine[document] (#combine[inlink](android apps))
The second #combine matches inlinks that contain "android"
and "apps". If a document has multiple inlinks that match, the first
#combine combines the scores from each matching inlink to produce a
score for the document. If no extent is specified, "document" is
assumed, thus an equivalent query is:
#combine (#combine[inlink](android apps))
Multiple Matches: If a field has several fields of the specified type (e.g., two "title" fields, or several "inlink" fields), the first method merges them into a single bag of words, whereas the second method considers them individually.
Smoothing: The first method smoothes the field probabilities using the full collection model, whereas the second method smoothes them using the document model and the full collection model; thus, the first method is a stricter interpretation of the field restriction, whereas the second method gives a little credit if a query term is missing from the field but appears elsewhere in the document.
The Indri query language, based on the InQuery query language, was designed to be robust. It can handle both simple keyword queries and extremely complex queries. Such a query language sets Indri apart from many other available search engines. It allows complex phrase matching, synonyms, weighted expressions, Boolean filtering, numeric (and dated) fields, and the extensive use of document structure (fields), among others.
Although Indri handles unstructured documents, many of the query language features make use of structured (tagged)
documents. Consider the following document:
<html> <head> <title>Department Descriptions</title> </head> <body> The following list describes ... <h1>Agriculture</h1> ... <h1>Chemistry</h1> ... <h1>Computer Science</h1> ... <h1>Electrical Engineering</h1> ... </body> </html>
In Indri, a document is viewed as a sequence of text that may contain arbitrary tags. In the example above, the document consists of text marked up with HTML tags.
For each tag type T within a document (i.e. title, body, h1, etc), we define the context of T to be all of the text and tags that appear within tags of type T. In the example above, all of the text and tags appearing between <body> and </body> tags defines the body context. A single context is generated for each unique tag name. Therefore, a context defines a subdocument. Note that because of nested tags certain word occurrences may appear in many contexts. It is also the case that there may be nested contexts. For example, within the <body> context there is a nested <h1> context made up of all of the text and tags that appear within the body context and within <h1> and </h1> tags. Here are the tags for the title, h1, and body contexts:
title context:
<title>Department Descriptions</title>
h1 context:
<h1>Agriculture</h1> <h1>Chemistry</h1> ... <h1>Computer Science</h1> ... <h1>Electrical Engineering</h1> ...
body context:
<body> The following list describes ... <h1>Agriculture</h1> ... <h1>Chemistry</h1> ... <h1>Computer Science</h1> ... <h1>Electrical Engineering</h1> ... </body>
Finally, each context is made up of one or more extents. An extent is a sequence of text that appear within a single begin/end tag pair of the same type as the context. For the example above, in the <h1> context, there are extents "<h1>agriculture</h1>", "<h1>chemistry<h1>", etc. Both the title and body contexts contain only a single extent because there is only a single pair of <title> ... </title> and <body> ... </body> tags, respectively. The number of extents for a given tag type T is determined by the number of sequences of the form: <T> text </T> that occur within the document.
query := ( beliefOp )+ beliefOp := "#weight" ( extentRestrict )? weightedList | "#combine" ( extentRestrict )? unweightedList | "#or" ( extentRestrict )? unweightedList | "#not" ( extentRestrict )? '(' beliefOp ')' | "#wand" ( extentRestrict )? weightedList | "#wsum" ( extentRestrict )? weightedList | "#max" ( extentRestrict )? unweightedList | "#prior" '(' FIELD ')' | "#filrej" '(' unscoredTerm beliefOp ')' | "#filreq" '(' unscoredTerm beliefOp ')' | termOp ( '.' fieldList )? ( '.' '(' fieldList ')' )? termOp := ( "#od" POS_INTEGER | "#od" | '#' POS_INTEGER ) '(' ( unscoredTerm )+ ')' | ( "#uw" POS_INTEGER | "#uw" ) '(' ( unscoredTerm )+ ')' | "#band" '(' ( unscoredTerm )+ ')' | "#date:before" '(' date ')' | "#date:after" '(' date ')' | "#date:between" '(' date ',' date ')' | "<" ( unscoredTerm )+ ">" | "{" ( unscoredTerm )+ "}" | "#syn" '(' ( unscoredTerm )+ ')' | "#wsyn" '(' ( weight unscoredTerm )+ ')' | "#any" ':' TERM | "#less" '(' TERM integer ')' | "#greater" '(' TERM integer ')' | "#between" '(' TERM integer integer ')' | "#equals" '(' TERM integer ')' | "#base64" '(' ( "\t" | " " )* ( BASE64_CHAR )+ ( "\t" | " " )* ')' | "#base64quote" '(' ( '\t' | ' ' )* ( BASE64_CHAR )+ ( '\t' | ' ' )* ')' | '"' text '"' | POS_INTEGER | POS_FLOAT | TERM extentRestrict := '[' "passage" POS_INTEGER ':' POS_INTEGER ']' | '[' FIELD ']' weightedList := '(' ( weight beliefOp )+ ')' unweightedList := '(' ( beliefOp )+ ')' unscoredTerm := termOp ( '.' fieldList )? fieldList := FIELD ( ',' FIELD )* date := POS_INTEGER '/' TERM '/' POS_INTEGER | POS_INTEGER TERM POS_INTEGER | TERM integer := POS_INTEGER | NEG_INTEGER weight := POS_FLOAT | POS_INTEGER TERM := ( '0'..'9' )+ ('a'..'z' | 'A'..'Z' | '-' | '_') | TEXT_TERM FIELD := TEXT_TERM TEXT_TERM := ( '\u0080'..'\u00ff' | ('a'..'z' | 'A'..'Z' | '0'..'9' | '-' | '_') )+ POS_INTEGER := ( '0'..'9' )+ NEG_INTEGER := '-' ( '0'..'9' )+ POS_FLOAT := ( '0'..'9' )+ '.' ( '0'..'9' )* BASE64_CHAR := ('a'..'z' | 'A'..'Z' | '0'..'9' | '+' | '/')
Terms are the basic building blocks of Indri queries. Terms come in the form of single term, ordered and unordered phrases, synonyms, among others. In addition, there are a number of options that allow you to specify if a term should appear within a certain field, or if it should be scored within a given context.
Examples:
Examples:
The first three expressions are equivalent. They each treat all of the expressions listed as synonyms. The #wsyn operator treats the terms as synonyms, but allows weights to be assigned to each term.
Examples:
Examples:
Examples:
Belief operators allow you to combine beliefs (scores) about terms, phrases, etc. There are both unweighted and weighted belief operators. With the weighted operators, you can assign varying weights to certain expressions. This allows you to control how much of an impact each expression within your query has on the final score.
Examples:
NOTE: If you are unsure which belief operator to use, it always "safest" to default to using the #combine or #weight operator. These operators are often the best choice for combining evidence. NEVER use #wsum or #wand unless you really know what you're doing!
Example:
Filter operators allow you to score only a subset of an entire collection by filtering out those documents that actually get scored.
Examples:
Numeric and date field operators provide a number of facilities for matching different criteria. These operators are very useful when used in combination with the filter operators.
Acceptable date formats:
Examples:
NOTE: The general numeric operators only work on indexed numeric fields, whereas the date operators are only applicable to a specially indexed numeric field named "date". See the indexing documentation for more on numeric fields.
Document priors allow you impose a "prior probability" over the documents in a collection.
Example:
Here we list suggested uses of the language for several common information retrieval tasks.
Ad hoc retrieval is the standard information retrieval task of finding documents that are topically relevant to a given information need (query). One common probabilistic approach to ad hoc retrieval is the query likelihood retrieval paradigm from language modeling. It is very simple to construct an Indri query that ranks documents the same as query likelihood. For the query, "literacy rates africa", we construct the following Indri query:
#combine( literacy rates africa )
Both pseudo-relevance feedback and query expansion methods typically begin with some intial query, do some processing, and then return a list of expansion terms. The original query is then augmented with the expansion terms and rerun. Given the original query "hubble telescope repairs" and the expansion terms "universe", "system", "mission", "search", "galaxies" we can then construct the following Indri query:
#weight( 0.75 #combine ( hubble telescope achievements ) 0.25 #combine ( universe system mission search galaxies ) )
Named page finding and homepage finding are examples of known-item search. That is, the user knows some page exists, and is attempting to find it. One popular approach to known-item search is to use a mixture of context language models. This can easily be expressed in the Indri query language. For example, for the query "bbc news", the following query would be constructed:
#combine( #wsum( 5.0 bbc.(title) 3.0 bbc.(anchor) 1.0 bbc ) #wsum( 5.0 news.(title) 3.0 news.(anchor) 1.0 news ) )
Note: many thanks to Don Metzler for this information.