Set Definitions (Vocabulary)

Introduction

The sets and classes used by the various linguistic annotation types are never defined in the FoLiA documents themselves, but externally in set definitions.

By using set definitions, a FoLiA document can be validated on a deep level, i.e. the validity of the used classes can be tested. Set definitions provide semantics to the FoLiA documents that use them and are an integral part of FoLiA. When set definitions are absent, validation can only be conducted on a shallow level that is agnostic about all sets and the classes therein.

Recall that all sets that are used need to be declared in the Annotation Declarations section in the document header and that they point to URLs holding a FoLiA set definitions. If no set definition files are associated, then a full in-depth validation cannot take place.

The role of FoLiA Set Definitions is:

  • to define which classes are valid in a set
  • to define which subsets and classes are valid in Features in a set
  • to constrain which subsets and classes may co-occur in an annotation of the set
  • to allow enumeration over classes and subsets
  • to assign human-readable labels to symbolic classes
  • to relate classes to external resources defining them (data category registries, linked data)
  • to define a hierarchy/taxonomy of classes

Prior to FoLiA v1.4, set definitions were stored in a simple custom XML format, distinct from FoLiA itself, which we call the legacy format and which is still supported for backward compatibility. Since FoLiA v1.4 however, we strongly prefer and recommend to store the set definitions as RDF [RDF], i.e. the technology that powers the semantic web. In this way, set definitions provide a formal semantic layer for FoLiA.

Set definitions may be stored in various common RDF serialisation formats. The format can be indicated on the declarations in the document metadata using the format attribute, recognised values are:

  • application/rdf+xml – XML for RDF (assumed for rdf.xml or rdf extensions
  • text/turtleTurtle (for RDF) (assumed for ttl extensions)
  • text/n3 – Notation 3 (for RDF) (assumed for n3 extensions)
  • application/foliaset+xml - Legacy FoLiA Set Definition format (XML) (assumed for xml extensions and in most other cases)

FoLiA applications should attempt to autodetect the format based on the extension. Not all applications may be able to deal with all formats/serialisations, however.

In this documentation, we will use the Turtle format for RDF, alongside our older legacy format. In all cases, FoLiA requires that only one set is defined per file, any other defined sets must be subsets of the primary set. In our legacy XML format, an otherwise empty set definition would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<set
 xmlns="http://ilk.uvt.nl/folia"
 xml:id="your-set-id" type="closed" label="Human readable label for your set">
</set>

Note that the legacy XML format takes an XML namespace that is always the same (the FoLiA namespace).

In RDF, FoLiA Set Definitions follow a particular model. The model we use is a small superset of the SKOS model. SKOS is a W3C standard for the representation of Simple Knowledge Organization Systems [SKOS]. Not everything can be expressed in the SKOS model, so we have some extensions to it which are formally defined in our set definition schema at https://raw.githubusercontent.com/proycon/folia/master/schemas/foliasetdefinition.ttl. The RDF namespace for our extension is http://folia.science.ru.nl/setdefinition#, for which we use the prefix fsd: generally, though this is mere convention.

Some familiarity with RDF and Turtle is recommended for this chapter, but it is also still possible to work with the XML legacy format, which is a bit more concise and simple, and automatically convert it to Turtle format using our superset of the SKOS model.

Your own set definitions typically has its own RDF namespace, which in Turtle syntax is defined by the @base directive at the top of your set definition.

Warning

Never reuse the SKOS or FoLiA Set Definition namespaces!

@base <http://your/namespace/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix fsd: <http://folia.science.ru.nl/setdefinition#> .

SKOS uses a different terminology than we do, which may be the source of some confusion. We attempt to map the terms in the following table:

Our term SKOS SKOS class
Set/Subset ID Collection Notation skos:Collection skos:notation

After this preamble, we can define a set as follows:

<#your-set-id>
    a skos:Collection ;
    skos:notation   "your-set-id" ;
    skos:prefLabel  "Human readable label for your set" ;
    fsd:open        false .

The first two lines state that http://your/namespace/#your-set-i is a [1] SKOS Collection, which is what we use for FoLiA Sets. The skos:notation property corresponds to the ID of the Set, only one is allowed [2] .

A set can be either open or closed (default), an open set allows any classes, even if they are not defined. This can be used for open vocabularies. The fsd:open property is used to indicate this, it is not part of SKOS but an extension of ours, hence the different namespace prefix.

References

[RDF]Richard Cyganiak, David Wood and Markus Lanthaler (2014). RDF 1.1 Concepts and Abstract Syntax (website)
[SKOS]Alistair Miles & Sean Bechhofer (2009). SKOS: Simple Knowledge Organization System Reference (website)

Footnotes

[1]the a in Turtle syntax is shorthand for rdf:type
[2]Technically, SKOS allows multiple, but we restrict it for Set Definitions.

Classes

A set (collection in SKOS terms) consists of classes (concepts in SKOS terms). Consider a simple part-of-speech set with three classes. First we define the set and refer to all the classes it contains:

<#simplepos>
    a skos:Collection ;
    skos:notation   "simplepos" ;
    skos:prefLabel "A simple part of speech set" ;
    skos:member <#N> , <#V> , <#A> .

Then we define the classes:

<#N>
    a skos:Concept ;
    skos:notation   "N" ;
    skos:prefLabel "Noun" .

<#V>
    a skos:Concept ;
    skos:notation   "V" ;
    skos:prefLabel "Verb" .

<#A>
    a skos:Concept ;
    skos:notation   "A" ;
    skos:prefLabel "Adjective" .

The ID (skos:notation) of the class is mandatory for FoLiA Set Definitions and determines a value the class attribute may take in the FoLiA document, for elements of this set. The skos:prefLabel property, both on the set itself as well as the classes, carries a human readable description for presentational purposes, this is optional but highly recommended.

In our legacy set definition format this is fairly straightforward and more concise:

<set
  xmlns="http://ilk.uvt.nl/folia"
  xml:id="simplepos" type="closed"
  label="Simple Part-of-Speech">
  <class xml:id="N" label="Noun" />
  <class xml:id="V" label="Verb" />
  <class xml:id="A" label="Adjective" />
</set>

Class Hierarchy

In FoLiA Set Definitions, classes can be nested to create more complex hierarchies or taxonomy trees, in which both nodes and leaves act as valid classes. This is best illustrated in our legacy XML format first. Consider the following set definition for named entities, in which the location class has been extended into more fine-grained subclasses.

<set xml:id="namedentities" type="closed" xmlns="http://ilk.uvt.nl/folia">
  <class xml:id="per" label="Person" />
  <class xml:id="org" label="Organisation" />
  <class xml:id="loc" label="Location">
    <class xml:id="loc.country" label="Country" />
    <class xml:id="loc.street" label="Street" />
    <class xml:id="loc.building" label="Building">
      <class xml:id="loc.building.hospital" label="Hospital" />
      <class xml:id="loc.building.church" label="Church" />
      <class xml:id="loc.building.station" label="Station" />
    </class>
  </class>
</set>

In the SKOS model, this is more verbose as the hierarchy has to be modelled explicitly using the skos:broader property, as shown in the following excerpt:

<#namedentities>
    a skos:Collection ;
    skos:member <#loc> , <#loc.country> .

<#loc>
    a skos:Concept ;
    skos:notation   "loc" ;
    skos:prefLabel "Location" .

<#loc.country>
    a skos:Concept ;
    skos:notation   "loc.country" ;
    skos:prefLabel "Country" ;
    skos:broader <#loc> .

It is recommended, but not mandatory, to set the class ID (skos:notation) of any nested classes to represent a full path, as a full path makes substring queries possible. FoLiA, however, does not dictate this and neither does it prescribe a delimiter for such paths, so the period in the above example (loc.country) is merely a convention. Each ID, however, does have to be unique in the entire set.

Subsets

The section on Features introduced subsets. Please ensure you are familiar with this notion before continuing with the current section.

Subset can be defined in a similar fashion to sets. Consider the legacy XML format first:

<set xml:id="simplepos" type="closed" xmlns="http://ilk.uvt.nl/folia">
  <class xml:id="N" label="Noun" />
  <class xml:id="V" label="Verb" />
  <class xml:id="A" label="Adjective" />
  <subset xml:id="gender" class="closed">
      <class xml:id="m" label="Masculine" />
      <class xml:id="f" label="Feminine" />
      <class xml:id="n" label="Neuter" />
  </subset>
</set>

In RDF, subsets are defined as SKOS Collections, just like the primary set. The primary set refers to the subsets using the same skos:member relation as is used for classes/concepts.

<#simplepos>
    a skos:Collection ;
    skos:member <#N> , <#V> , <#A> , <#gender> .

<#gender>
    a skos:Collection ;
    skos:notation   "gender" ;
    skos:member <#gender.m> .

<#gender.m>
    a skos:Concept ;
    skos:notation   "m" ;
    skos:prefLabel "Location" .

Note that in this example, we prefixed the resource name for the class (#gender.m instead of #m). This is just a recommended convention as URIs have to be unique and we may want to re-use the m ID in other subsets as well. The ID in the skos:notation property does not need to carry this prefix, as it needs only be unique within the subset. This property always determines how it is referenced from the FoLiA document, so we would still get <feat subset="gender" class="m" />

Constraints

It is possible to define constriants on which subsets can be used with which classes and which classes within subsets can be combined, though SKOS has no mechanism to express such constraints. We introduce our own resources and properties to define to define constraints, in the namespace of our extension ( http://folia.science.ru.nl/setdefinition#, with prefix fsd: in this documentation).

The core of the constraints is the fsd:constrain relation which can be made between any subset (skos:Collection) and class (skos:Concept). Consider the following Part-of-Speech tag example in which we constrain the subset gender to only occur with nouns:

<#simplepos>
     a skos:Collection ;
     skos:member <#N> .

example:N a skos:Concept ;
    skos:notation "N" ;
    skos:prefLabel "Noun" .

example:gender a skos:Collection ;
    skos:member example:masculine, example:feminine, example:neuter ;
    fsd:constrain example:N .

The same can be expressed in our legacy format as follows. Note that we left out the definition for the three genders in the RDF example for brevity.

<set xml:id="simplepos" type="closed" xmlns="http://ilk.uvt.nl/folia">
   <class xml:id="N" label="Noun" />
   <subset xml:id="gender" type="closed">
     <class xml:id="masculine" label="masculine" />
     <class xml:id="feminine" label="feminine" />
     <class xml:id="neuter" label="neuter" />
     <constrain id="N" />
   </subset>
</set>

Multiple constrain relations may be specified, but one has to be aware that this then counts as a conjunction or intersection. What we often see instead when multiple relations is the use of a fsd:Constraint class, which acts as a collection of contrain relations and can explicitly express the type (fsd:constraintType) of matching to apply to the constraints. The type be any of the following:

  • "any" - Only of of the constrain relations must match for the constraint to pass
  • "all" - All constrain relations must match for the constraint to pass
  • "none" - None of the constrain relations must match for the constraint to pass

The other main purpose of the fsd:Constraint class is to avoid repetition, as it allows a complex contraint to be referenced from multiple locations. Consider the following example, first in our legacy format:

<set xml:id="simplepos" type="closed" xmlns="http://ilk.uvt.nl/folia">
   <class xml:id="N" label="Noun" />
   <class xml:id="A" label="Adjective" />
   <class xml:id="V" label="Verb" />
   <subset xml:id="gender" type="closed">
     <class xml:id="masculine" label="masculine" />
     <class xml:id="feminine" label="feminine" />
     <class xml:id="neuter" label="neuter" />
     <constrain id="constraint.1" />
   </subset>
   <subset xml:id="case" type="closed">
     <class xml:id="nom" label="nominative" />
     <class xml:id="gen" label="genitive" />
     <class xml:id="dat" label="dative" />
     <class xml:id="acc" label="accusative" />
     <constrain id="constraint.1" />
   </subset>
   <constraint xml:id="constraint.1" type="any">
     <constrain id="N" />
     <constrain id="A" />
   </constraint>
</set>

In RDF, the constraint would be formulated as follows:

example:constraint.1 a fsd:Constraint ;
    fsd:constraintType "any" ;
    fsd:constrain example:N ;
    fsd:constrain example:A .

A fsd::constrain relation may be used within sets (skos:Collection), classes (skos:Concept) as well as constraints (fsd:Constraint). Similary, a fsd:constrain relation may point to either of the three. All this combined allows for complex nesting logic.

The following example shows a more complete set definition with various kinds of constraints, we show it both in legacy XML as well as turtle RDF:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
<set xml:id="simplepos" type="closed" xmlns="http://ilk.uvt.nl/folia">
   <class xml:id="N" label="Noun">
     <constrain id="constraint.2" />
   </class>

   <class xml:id="A" label="Adjective">
     <constrain id="constraint.2" />
   </class>

   <class xml:id="V" label="Verb">
     <constrain id="tense" />
     <constrain id="number" />
   </class>

   <subset xml:id="gender" type="closed">
     <class xml:id="m" label="masculine" />
     <class xml:id="f" label="feminine" />
     <class xml:id="n" label="neuter" />
     <constrain id="constraint.1" />
   </subset>

   <subset xml:id="case" type="closed">
     <class xml:id="nom" label="nominative" />
     <class xml:id="gen" label="genitive" />
     <class xml:id="dat" label="dative" />
     <class xml:id="acc" label="accusative" />
     <constrain id="constraint.1" />
   </subset>

   <subset xml:id="number" type="closed">
     <class xml:id="s" label="singular" />
     <class xml:id="p" label="plural" />
   </subset>

   <subset xml:id="tense" type="closed">
     <class xml:id="present" label="present" />
     <class xml:id="past" label="past" />
     <constrain id="V" />
   </subset>

   <constraint xml:id="constraint.1" type="any">
     <!-- This is a constraint expressing which classes the subset using this constraint is valid -->
     <constrain id="N" />
     <constrain id="A" />
   </constraint>

   <constraint xml:id="constraint.2" type="all">
     <!-- This is a constraint expressing which subsets are required by the class using it-->
     <constrain id="gender" />
     <constrain id="case" />
     <constrain id="number" />
   </constraint>
</set>
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
@prefix fsd: <http://folia.science.ru.nl/setdefinition#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix simplepos: <http://folia.science.ru.nl/setdefinition/simplepos#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

simplepos:Set a skos:Collection ;
    skos:member simplepos:A,
        simplepos:N,
        simplepos:Subset.case,
        simplepos:Subset.gender,
        simplepos:Subset.number,
        simplepos:Subset.tense,
        simplepos:V ;
    skos:notation "simplepos" .

simplepos:Subset.tense a skos:Collection ;
    fsd:constrain simplepos:V ;
    skos:member simplepos:past,
        simplepos:present ;
    skos:notation "tense" .

simplepos:acc a skos:Concept ;
    fsd:sequenceNumber 4 ;
    skos:notation "acc" ;
    skos:prefLabel "accusative" .

simplepos:dat a skos:Concept ;
    fsd:sequenceNumber 3 ;
    skos:notation "dat" ;
    skos:prefLabel "dative" .

simplepos:f a skos:Concept ;
    fsd:sequenceNumber 2 ;
    skos:notation "f" ;
    skos:prefLabel "feminine" .

simplepos:gen a skos:Concept ;
    fsd:sequenceNumber 2 ;
    skos:notation "gen" ;
    skos:prefLabel "genitive" .

simplepos:m a skos:Concept ;
    fsd:sequenceNumber 1 ;
    skos:notation "m" ;
    skos:prefLabel "masculine" .

simplepos:n a skos:Concept ;
    fsd:sequenceNumber 3 ;
    skos:notation "n" ;
    skos:prefLabel "neuter" .

simplepos:nom a skos:Concept ;
    fsd:sequenceNumber 1 ;
    skos:notation "nom" ;
    skos:prefLabel "nominative" .

simplepos:p a skos:Concept ;
    fsd:sequenceNumber 2 ;
    skos:notation "p" ;
    skos:prefLabel "plural" .

simplepos:past a skos:Concept ;
    fsd:sequenceNumber 2 ;
    skos:notation "past" ;
    skos:prefLabel "past" .

simplepos:present a skos:Concept ;
    fsd:sequenceNumber 1 ;
    skos:notation "present" ;
    skos:prefLabel "present" .

simplepos:s a skos:Concept ;
    fsd:sequenceNumber 1 ;
    skos:notation "s" ;
    skos:prefLabel "singular" .

simplepos:A a skos:Concept ;
    fsd:constrain simplepos:constraint.2 ;
    fsd:sequenceNumber 2 ;
    skos:notation "A" ;
    skos:prefLabel "Adjective" .

simplepos:N a skos:Concept ;
    fsd:constrain simplepos:constraint.2 ;
    fsd:sequenceNumber 1 ;
    skos:notation "N" ;
    skos:prefLabel "Noun" .

simplepos:Subset.case a skos:Collection ;
    fsd:constrain simplepos:constraint.1 ;
    skos:member simplepos:acc,
        simplepos:dat,
        simplepos:gen,
        simplepos:nom ;
    skos:notation "case" .

simplepos:Subset.gender a skos:Collection ;
    fsd:constrain simplepos:constraint.1 ;
    skos:member simplepos:f,
        simplepos:m,
        simplepos:n ;
    skos:notation "gender" .

simplepos:Subset.number a skos:Collection ;
    skos:member simplepos:p,
        simplepos:s ;
    skos:notation "number" .

simplepos:V a skos:Concept ;
    fsd:constrain simplepos:number,
        simplepos:tense ;
    fsd:sequenceNumber 3 ;
    skos:notation "V" ;
    skos:prefLabel "Verb" .

simplepos:constraint.1 a fsd:Constraint ;
    fsd:constrain simplepos:A,
        simplepos:N ;
    fsd:constraintType "any" .

simplepos:constraint.2 a fsd:Constraint ;
    fsd:constrain simplepos:Subset.case,
        simplepos:Subset.gender,
        simplepos:Subset.number ;
    fsd:constraintType "all" .

SKOS

SKOS allows for more expressions to be made, and of course the full power of open linked data is available up to be used with FoLiA Set Definitions. The previous subsections layed out the minimal requirements for FoLiA Set Definitions using the SKOS model.

The use of skos:OrderedCollection is currently not supported yet, skos:Collection is mandatory. Ordering of classes (SKOS Concepts) can currently be indicated through a separate fsd:sequenceNumber property.

FoLiA Set Definitions must be complete, that is to say that all sets (SKOS collections) and classes (SKOS concepts) must be fully defined in one and the same set definition file.

Note

The file need not be static but can be dynamically generated server-side; which must be publicly available from a URL. A set definition must contain one and only one primary set (SKOS collection), all other sets must be subsets (SKOS collections that are a member of the primary set, no deeper nesting is supported).