ართული ნის როვნული ორპუსი
The Georgian National Corpus
The query language

Corpuscle has its own query language that is modeled along CQL, the regular expression language of Corpus Workbench, but it both extends it and deviates from it in some respects.

In addition to the full Corpuscle query language, there is also a much restricted simplified query language that allows inputting simple queries without extra quotes, escaping, and complicated regular expressions. See The simplified query language.

Choose Advanced search when writing a query in the full Corpuscle query language.

The basic building blocks of a query are Positional constraints. A positional constraint matches a corpus position if all attribute values in that corpus position satisfy the conditions of the positional constraint.

In addition, a query can contain constraints that delineate a subcorpus that the query should be constrained to. They are called Subcorpus constraints.

Incompatible changes

The syntax for suppressing parts of matches has changed. You have to use double instead of single curly braces. See below.

Positional constraints

Normally, a positional constraint is an expression written in brackets ([...]). In the simplest case, the brackets are empty (), they contain no condition, which means that the constraint matches any corpus position.
Otherwise, the brackets contain Attribute constraints, which in the simplest case are of the form attribute="value". See below for details.

Positional constraints can be combined to a regular expression using the following operators:

  • sequence operator (juxtaposition of positional constraints)

and operators that apply to the expression (a positional constraint or a complex expression in parentheses) in front of them:

  • Kleene star (, arbitrary many repetitions, including zero),
  • Kleene plus (+, at least one repetition),
  • bounded repetition ({n,m}, between n and m repetitions),
  • optionality (?) and
  • disjunction (|).
  • Parentheses ((...)) can be used to group expressions.
  • Proximity search (comma-separated list of expressions) is implemented as an experimental feature.

A query expression that illustrates the use of some of those operators would be

"I" [pos="V"] []{1,3} "to" [pos="V"]

which would match the phrase “I want these things to happen”, but not “I have to go”.

Structural positions

Structural positions (XML tags) can also be queried for. They are written in standard XML notation (e.g., <s> or </s>). When a start and an end tag with the same element name appear in a query expression, it is assumed that they should be balanced in a match. This means for example that a query like

<s>  "jeg"  "deg" </s>

always should find “jeg” and “deg” in the same sentence. It also means that in a situation where XML elements can be embedded in elements of the same type, the retrieved start and end tags always correspond. A query like

<NP>  "girl"  </NP>

would match the phrase

<NP>the girl with <NP>the telescope</NP></NP>

and not the shorter sequence

<NP>the girl with <NP>the telescope</NP>

since here, the </NP> is not the corresponding end tag of the leftmost <NP>.

Ignoring structural positions and ranges

When writing a query, one has to be aware that structural positions are corpus positions in their own right. (This is in contrast to CWB.) A query like

"with" "the" "telescope"

will not, as perhaps intended, match the subsequence "with <NP>the telescope" of (5), because of the intervening <NP>. A query that would match that subsequence can be written like this:

"with" <> "the" <> "telescope"

to allow for XML tags between the words, where <> denotes an arbitrary start or end tag. The same query can be formulated in a more succinct and elegant way using the ‘ignore tag’ operator (written as backslash, \), followed by the tags that should be ignored:

"with" "the" "telescope" \ <>

In some corpora, text can interspersed with non-textual material. Here it is desirable to ignore not only XML tags (structural positions), but entire XML elements (structural ranges). This can be achieved using the ignore element operator (written as double backslash, ‘\\’). In query evaluation, all elements with start tags listed after the ignore element operator will be disregarded.

A good example are critical editions, where single words often are annotated with <note> or other similar elements. When searching for a phrase, one would like to disregard from those elements.

The query

"de" "ti" "Aar"

will not match the XML fragment

... de ti <note> ( fulgt af Martensen ) </note> Aar ...

because of the intervening <note> element. The corresponding query

"de" "ti" "Aar" \\ <note>

on the other hand, will match the fragment.

Attribute constraints

A constraint containing an attribute constraint would be

[word="fish"]
which would match every corpus position where the word attribute had the value fish. This constraint can be written abbreviated as ["fish"] or even "fish".

In a positional constraint, attribute constraints can be combined using the boolean operator and (&), e.g.:

["fish" & pos="verb"]

In addition to a literal string, the value expression in an attribute constraint can be a regular expression, e.g.,

[lemma="book.*"]

To give characters that have a special meaning in regular expressions their literal meaning, they have to be escaped with a backslash. For instance, "\." matches a dot, whereas "." would raise an error. Here is the complete list of characters that have to be escaped:

. , ? + * ( [ | ^ $

(There are, however more characters that have a special meaning.)

See here for details on regular expressions on strings.

Attributes of XML tags that have been indexed can be queried. The syntax is equal to the XML tag syntax, with the difference that values of attributes can be regular expressions. A valid query would be:

<s type=“main” lang=“nob|nno”>

Numerical operators

If an attribute is integer-valued, the following numerical operators can be used:

 <, <=, >, >=

If an attribute has intervals of positive integers as values, those operators are defined:

  • intersection: #
  • inclusion: #<
  • subsumption: #>

An interval is coded in the form "n-m", meaning the integer interval [n,m].
The nineteenth century as value of a date range could thus be coded as "1800-1899". If n=m, the interval can be abbreviated as "n"; intervals unbounded to the left or to the right are coded as "n-" and "-m", respectively.
A convenient abbreviation for centuries and decades is to use a literal x to mean the numbers 0-9; so the 20th century would be "19xx", equivalent to "1900-1999", and the 60s would be "196x", equivalent to "1960-1969".

Multi-valued and set-valued attributes

Attribute values can be atomic or structured. The ‘word’ attribute is atomic, but attributes that encode grammatical features (part of speech, morphosyntax etc.) and other linguistic annotations can be structured in two different ways: attributes can be multi-valued, and they can be set-valued, or even a combination of both. A typical multi-valued attribute is the ‘lemma’ attribute in a lemmatized corpus where the readings are not totally disambiguated. When using a statistical tagger for morphosyntactic parsing, all readings are normally fully disambiguated, but rule-driven parsers like for example Constraint Grammar parsers output ambiguous readings. The morphosyntactic tags that a Constraint Grammar parser attaches to a reading are a good example for a set-valued attribute (‘morph’). Since readings can be ambiguous, that attribute is multi-valued at the same time.

Corpuscle has built-in support for structured attributes, both for multivalued and set-valued attributes and combinations thereof. This support is also reflected in the query language. Plain multi-valued attributes do not need special syntax; a corpus position matches a constraint on such an attribute if at least one of the values matches the constraint.

Consider for example the Norwegian word fisker, which can mean ‘fishes’ (pl) or ‘fisherman’. Thus, a non-fully disambiguated reading could have

word = fisker,	lemma = fisk | fisker

Both a search for [lemma = “fisk”] and for [lemma = “fisker”] will match that position, as one would expect. It is however also possible to search for unambiguous readings only, by using the ‘unambiguously equal’- or ‘strict equal’- operator (‘==’):
[ lemma == “fisk” ]

Query (14) would not match the corpus position in (13). Similarly, there is, besides the ‘not-equal’-operator (‘!=’), a ‘strict-non-equal’-operator (‘!!=’) which can be used to demand that none of the readings should match the value in the query:

[ lemma !!= “fisk” ]

Again, query (15) would not match the corpus position in (13). The values of set-valued attributes are stored as strings with a separator character (space, ‘|’ or similar) between the set members. Thus, the ‘morph’ value set (N m pl ) (i.e., “Noun” “masculine” “plural”) would be encoded as " N m pl ”, and in principle, a regular expression could be used to search for subset containment. Corpuscle has however a much more efficient implementation of subset search which is based on suffix arrays, and a syntax extension that makes it easy to formulate queries on set-valued attributes: the values that are searched for can be given as a set themselves, or, more generally, as a boolean expression of set values. Consider the morphosyntactic annotation of the previous example in (16):

word=fisker, lemma=fisk|fisker, morph=(N m pl)|(N m sg)

Here, query (17) would match the corpus position (16).

[ morph = (“N” “pl” | “A” !“sup”) ]

The morph attribute is in fact an example of an attribute that is both multi-valued and set-valued.
The available operators are, as can be seen from the example, boolean AND (juxtaposition), OR (|), and NOT (!), where parentheses are used for grouping.

Abbreviated syntax for some attributes

As we have seen, the query [word="krabbene"] can be abbreviated as "krabbene". Similar abbreviations are available for the attributes lemma (/…/) and features ({…}.

You may write

/krabbe/

instead of

[lemma="krabbe"]

and

{N (Masc | Fem) !Pl}

instead of

[features=("N" ("Masc" | "Fem") !"Pl")]
.

Of course, this works only for corpora that have the lemma and/or features attribute(s).

Some corpora have attributes with different names but similar semantics. In these corpora, the abbreviated syntax relates to those attributes. E.g., if a ‘bag of tags’ attribute is called morph instead of features, {V Aor} will stand for [morph=("V" "Aor")].

Which attributes have abbreviated syntax can be seen on the Overview page of each single corpus.

Subcorpus constraints

A query can be constrained to a subcorpus by appending the subcorpus constraints at the end of the query, after double colons. Since subcorpus constraints are not positional constraints (they apply to a whole document or other structural unit), they are not to be put into brackets. The following query would give all occurences of “book” in texts of genre “prose” and author “Smith”:

"book" :: genre="prose" & author="Smith"

In subcorpus constraints, all attributes can used that do have a scope other than cpos. (On the Overview page, all attributes of a given corpus are listed, together with their scope and other information.)

Predicates

Predicates are functions on corpus positions that evaluate to true or false; they can be seen as generalizations of feature constraints. (A feature constraint of the form att=val is a predicate that evaluates to true in all corpus positions where att assumes the value val.)

A predicate has the general form

_name(arg1,…,argn)

and can be used as part of a positional constraint, inside angular brackets.

By now, only the predicate _amb(f,n) is defined. It is true in those corpus positions where the feature f has exactly ambiguity n. The variant _amb(f,n,+) is true where the feature f has at least ambiguity n. Thus, the query

[_amb(lemma,2,+)]

matches all positions where the lemma form is ambiguous.

This predicate can only meaningfully be used for multi-valued attributes.

Variables

Positional constraints in a query expression can be assigned to variables. These variables can be used to impose additional (non-regular) constraints on matching corpus positions.

A variable name consists of a hash (#) character and one or more alphanumeric characters. It is prepended to a positional constraint in brackets, with a colon in between:

#x1:[…] #x2:[…]

Relations between positional constraints are expressed at the end of the query, separated by double colons. The only relations presently supported are equality and inequality of the values of a positional attribute.

The following query matches all occurences of two consecutive equal words:

#x1: #x2: :: #x1.word = #x2.word

Match ranges and match targets

Match ranges

As default, only shortest matches are returned.

A query like

[ morph = (“Adj”) ]+ [ morph = (“N”) ]

would thus only return matches consisting of one adjective and a noun, even if the corpus contained nouns preceded by more than one adjective.

If longest matches are required, the greedy match flag %g has to be set. The query

[ morph = (“Adj”) ]+ [ morph = (“N”) ] %g

will thus return matches consisting of a noun together with the full sequence of adjectives preceding it.

Greedy match cannot (yet) be used in query expressions containing tags.

Match targets

By default, attribute values that are shown in a KWIC concordance are those of the first corpus position in a match. In order to view the attribute values of some other position, the position can be declared to be the target of the query. This is done by prepending an @ to the positional constraint.

"to" @[pos="V"]

Suppressing parts of matches

Sometimes only some of the positions in a query should count as match positions, the other positions are merely used to further constrain the query. To achieve this, the relevant positions can be enclosed in double curly braces. (OBS: in previous versions of Corpuscle, single curly braces were used.) E.g., the query

"big" {{ [pos="N"] }}

would find all nouns preceded by “big”, but in the match column of the KWIC concordance, only the nouns themselves would occur. Braces at the left or right end of a query are not needed, so the example query could also be formulated in a slightly simpler way:

"big" {{ [pos="N"]

Querying parallel corpora

In a parallel corpus, corpus positions or structural regions of the source corpus are aligned with corpus positions or structural regions of the target corpus.

When querying a parallel corpus, source and target corpus are queried independently (with queries Qs and Qt), and the query results can be combined in different ways to give a set of matches:

  • With the intersection operator >i:

Qs >i Qt

Matching alignment regions of Qs and Qt are intersected; that is, a match m of Qs is a match of the combined query if there is a match of Qt aligned to m. (This corresponds to “Qs :HANSARD Qt” in Corpus Workbench.)

  • With the difference or minus operator >m:

Qs >m Qt

Matching alignment regions of Qt are subtracted from matching regions of Qs; that is, a match m of Qs is a match of the combined query if there is no match of Qt aligned to m. (This corresponds to “Qs :HANSARD ! Qt” in Corpus Workbench.)

Examples:

In a sentence-aligned Norwegian-English parallel corpus, the query “"jente" %c >i "girl" %c” gives you all occurences of “jente” such that the corresponding aligned sentence does contain the word “girl.”

The query “"jente" %c >m "girl" %c” gives you all occurences of “jente” such that the corresponding aligned sentence does not contain the word “girl.”

The queries on both sides of the the intersection and minus operators can be fully general queries.

A combination of >i and >m has yet to be implemented.

The graphical query interface cannot be used to build aligned queries.

TODO: A combination of >i and >m.


Design & implementation: Design & implementation: Paul Meurer, Universitetet i Bergen, CLARINO Centre, 2026 | Copyright (C) GNC Project 2011 – 2026