Set Definitions (Vocabulary)¶
Introduction¶
The sets and classes used by the various linguistic annotation types are never defined in the FoLiA documents themselves, but externally in set definitions.
By using set definitions, a FoLiA document can be validated on a deep level, i.e. the validity of the used classes can be tested. Set definitions provide semantics to the FoLiA documents that use them and are an integral part of FoLiA. When set definitions are absent, validation can only be conducted on a shallow level that is agnostic about all sets and the classes therein.
Recall that all sets that are used need to be declared in the Annotation Declarations section in the document header and that they point to URLs holding a FoLiA set definitions. If no set definition files are associated, then a full in-depth validation cannot take place.
The role of FoLiA Set Definitions is:
- to define which classes are valid in a set
- to define which subsets and classes are valid in Features in a set
- to constrain which subsets and classes may co-occur in an annotation of the set
- to allow enumeration over classes and subsets
- to assign human-readable labels to symbolic classes
- to relate classes to external resources defining them (data category registries, linked data)
- to define a hierarchy/taxonomy of classes
Prior to FoLiA v1.4, set definitions were stored in a simple custom XML format, distinct from FoLiA itself, which we call the legacy format and which is still supported for backward compatibility. Since FoLiA v1.4 however, we strongly prefer and recommend to store the set definitions as RDF [RDF], i.e. the technology that powers the semantic web. In this way, set definitions provide a formal semantic layer for FoLiA.
Set definitions may be stored in various common RDF serialisation formats. The
format can be indicated on the declarations in the document metadata using the
format
attribute, recognised values are:
application/rdf+xml
– XML for RDF (assumed forrdf.xml
orrdf
extensionstext/turtle
– Turtle (for RDF) (assumed forttl
extensions)text/n3
– Notation 3 (for RDF) (assumed forn3
extensions)application/foliaset+xml
- Legacy FoLiA Set Definition format (XML) (assumed forxml
extensions and in most other cases)
FoLiA applications should attempt to autodetect the format based on the extension. Not all applications may be able to deal with all formats/serialisations, however.
In this documentation, we will use the Turtle format for RDF, alongside our older legacy format. In all cases, FoLiA requires that only one set is defined per file, any other defined sets must be subsets of the primary set. In our legacy XML format, an otherwise empty set definition would look like this:
<?xml version="1.0" encoding="UTF-8"?>
<set
xmlns="http://ilk.uvt.nl/folia"
xml:id="your-set-id" type="closed" label="Human readable label for your set">
</set>
Note that the legacy XML format takes an XML namespace that is always the same (the FoLiA namespace).
In RDF, FoLiA Set Definitions follow a particular model. The model we use is a
small superset of the SKOS model. SKOS is a W3C standard for the representation
of Simple Knowledge Organization Systems [SKOS]. Not everything can be
expressed in the SKOS model, so we have some extensions to it which are
formally defined in our set definition schema at
https://raw.githubusercontent.com/proycon/folia/master/schemas/foliasetdefinition.ttl.
The RDF namespace for our extension is
http://folia.science.ru.nl/setdefinition#
, for which we use the prefix
fsd:
generally, though this is mere convention.
Some familiarity with RDF and Turtle is recommended for this chapter, but it is also still possible to work with the XML legacy format, which is a bit more concise and simple, and automatically convert it to Turtle format using our superset of the SKOS model.
Your own set definitions typically has its own RDF namespace, which in
Turtle syntax is defined by the @base
directive at the top of your set
definition.
Warning
Never reuse the SKOS or FoLiA Set Definition namespaces!
@base <http://your/namespace/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix fsd: <http://folia.science.ru.nl/setdefinition#> .
SKOS uses a different terminology than we do, which may be the source of some confusion. We attempt to map the terms in the following table:
Our term | SKOS | SKOS class |
---|---|---|
Set/Subset ID | Collection Notation | skos:Collection
skos:notation |
After this preamble, we can define a set as follows:
<#your-set-id>
a skos:Collection ;
skos:notation "your-set-id" ;
skos:prefLabel "Human readable label for your set" ;
fsd:open false .
The first two lines state that http://your/namespace/#your-set-i
is
a [1] SKOS Collection, which is what we use for FoLiA Sets. The skos:notation
property corresponds to the ID of the Set, only one is allowed [2] .
A set can be either open or closed (default), an open set allows any classes,
even if they are not defined. This can be used for open vocabularies. The
fsd:open
property is used to indicate this, it is not part of SKOS but
an extension of ours, hence the different namespace prefix.
References
[RDF] | Richard Cyganiak, David Wood and Markus Lanthaler (2014). RDF 1.1 Concepts and Abstract Syntax (website) |
[SKOS] | Alistair Miles & Sean Bechhofer (2009). SKOS: Simple Knowledge Organization System Reference (website) |
Footnotes
[1] | the a in Turtle syntax is shorthand for rdf:type |
[2] | Technically, SKOS allows multiple, but we restrict it for Set Definitions. |
Classes¶
A set (collection in SKOS terms) consists of classes (concepts in SKOS terms). Consider a simple part-of-speech set with three classes. First we define the set and refer to all the classes it contains:
<#simplepos>
a skos:Collection ;
skos:notation "simplepos" ;
skos:prefLabel "A simple part of speech set" ;
skos:member <#N> , <#V> , <#A> .
Then we define the classes:
<#N>
a skos:Concept ;
skos:notation "N" ;
skos:prefLabel "Noun" .
<#V>
a skos:Concept ;
skos:notation "V" ;
skos:prefLabel "Verb" .
<#A>
a skos:Concept ;
skos:notation "A" ;
skos:prefLabel "Adjective" .
The ID (skos:notation
) of the class is mandatory for FoLiA
Set Definitions and determines a value the class
attribute
may take in the FoLiA document, for elements of this set. The
skos:prefLabel
property, both on the set itself as well as the classes, carries a human
readable description for presentational purposes, this is optional but highly
recommended.
In our legacy set definition format this is fairly straightforward and more concise:
<set
xmlns="http://ilk.uvt.nl/folia"
xml:id="simplepos" type="closed"
label="Simple Part-of-Speech">
<class xml:id="N" label="Noun" />
<class xml:id="V" label="Verb" />
<class xml:id="A" label="Adjective" />
</set>
Class Hierarchy¶
In FoLiA Set Definitions, classes can be nested to create more complex hierarchies or taxonomy trees, in which both nodes and leaves act as valid classes. This is best illustrated in our legacy XML format first. Consider the following set definition for named entities, in which the location class has been extended into more fine-grained subclasses.
<set xml:id="namedentities" type="closed" xmlns="http://ilk.uvt.nl/folia">
<class xml:id="per" label="Person" />
<class xml:id="org" label="Organisation" />
<class xml:id="loc" label="Location">
<class xml:id="loc.country" label="Country" />
<class xml:id="loc.street" label="Street" />
<class xml:id="loc.building" label="Building">
<class xml:id="loc.building.hospital" label="Hospital" />
<class xml:id="loc.building.church" label="Church" />
<class xml:id="loc.building.station" label="Station" />
</class>
</class>
</set>
In the SKOS model, this is more verbose as the hierarchy has to be modelled
explicitly using the skos:broader
property, as shown in the following excerpt:
<#namedentities>
a skos:Collection ;
skos:member <#loc> , <#loc.country> .
<#loc>
a skos:Concept ;
skos:notation "loc" ;
skos:prefLabel "Location" .
<#loc.country>
a skos:Concept ;
skos:notation "loc.country" ;
skos:prefLabel "Country" ;
skos:broader <#loc> .
It is recommended, but not mandatory, to set the class ID
(skos:notation
) of any nested classes
to represent a full path, as a full path makes substring queries possible.
FoLiA, however, does not dictate this and neither does it prescribe a delimiter
for such paths, so the period in the above example (loc.country
) is merely a convention. Each
ID, however, does have to be unique in the entire set.
Subsets¶
The section on Features introduced subsets. Please ensure you are familiar with this notion before continuing with the current section.
Subset can be defined in a similar fashion to sets. Consider the legacy XML format first:
<set xml:id="simplepos" type="closed" xmlns="http://ilk.uvt.nl/folia">
<class xml:id="N" label="Noun" />
<class xml:id="V" label="Verb" />
<class xml:id="A" label="Adjective" />
<subset xml:id="gender" class="closed">
<class xml:id="m" label="Masculine" />
<class xml:id="f" label="Feminine" />
<class xml:id="n" label="Neuter" />
</subset>
</set>
In RDF, subsets are defined as SKOS Collections, just like the primary set. The primary set refers to the subsets using
the same skos:member
relation as is used for classes/concepts.
<#simplepos>
a skos:Collection ;
skos:member <#N> , <#V> , <#A> , <#gender> .
<#gender>
a skos:Collection ;
skos:notation "gender" ;
skos:member <#gender.m> .
<#gender.m>
a skos:Concept ;
skos:notation "m" ;
skos:prefLabel "Location" .
Note that in this example, we prefixed the resource name for the class
(#gender.m
instead of #m
). This is just a recommended
convention as URIs have to be unique and we may want to re-use the m
ID in other subsets as well. The ID in the skos:notation
property does not need to
carry this prefix, as it needs only be unique within the subset. This property
always determines how it is referenced from the FoLiA document, so we would still get
<feat subset="gender" class="m" />
Constraints¶
It is possible to define constriants on which subsets can be used with which classes and which classes within subsets
can be combined, though SKOS has no mechanism to express such constraints. We introduce our own resources and properties
to define to define constraints, in the namespace of our extension ( http://folia.science.ru.nl/setdefinition#
, with
prefix fsd:
in this documentation).
The core of the constraints is the fsd:constrain
relation which can be made between any subset (skos:Collection
)
and class (skos:Concept
). Consider the following Part-of-Speech tag example in which we constrain the subset
gender to only occur with nouns:
<#simplepos>
a skos:Collection ;
skos:member <#N> .
example:N a skos:Concept ;
skos:notation "N" ;
skos:prefLabel "Noun" .
example:gender a skos:Collection ;
skos:member example:masculine, example:feminine, example:neuter ;
fsd:constrain example:N .
The same can be expressed in our legacy format as follows. Note that we left out the definition for the three genders in the RDF example for brevity.
<set xml:id="simplepos" type="closed" xmlns="http://ilk.uvt.nl/folia">
<class xml:id="N" label="Noun" />
<subset xml:id="gender" type="closed">
<class xml:id="masculine" label="masculine" />
<class xml:id="feminine" label="feminine" />
<class xml:id="neuter" label="neuter" />
<constrain id="N" />
</subset>
</set>
Multiple constrain relations may be specified, but one has to be aware that this then counts as a conjunction or
intersection. What we often see instead when multiple relations is the use of a fsd:Constraint
class, which acts as
a collection of contrain relations and can explicitly express the type (fsd:constraintType
) of matching to apply to the constraints. The type be any of the following:
"any"
- Only of of the constrain relations must match for the constraint to pass"all"
- All constrain relations must match for the constraint to pass"none"
- None of the constrain relations must match for the constraint to pass
The
other main purpose of the fsd:Constraint
class is to avoid repetition, as it allows a complex contraint to be
referenced from multiple locations. Consider the following example, first in our legacy format:
<set xml:id="simplepos" type="closed" xmlns="http://ilk.uvt.nl/folia">
<class xml:id="N" label="Noun" />
<class xml:id="A" label="Adjective" />
<class xml:id="V" label="Verb" />
<subset xml:id="gender" type="closed">
<class xml:id="masculine" label="masculine" />
<class xml:id="feminine" label="feminine" />
<class xml:id="neuter" label="neuter" />
<constrain id="constraint.1" />
</subset>
<subset xml:id="case" type="closed">
<class xml:id="nom" label="nominative" />
<class xml:id="gen" label="genitive" />
<class xml:id="dat" label="dative" />
<class xml:id="acc" label="accusative" />
<constrain id="constraint.1" />
</subset>
<constraint xml:id="constraint.1" type="any">
<constrain id="N" />
<constrain id="A" />
</constraint>
</set>
In RDF, the constraint would be formulated as follows:
example:constraint.1 a fsd:Constraint ;
fsd:constraintType "any" ;
fsd:constrain example:N ;
fsd:constrain example:A .
A fsd::constrain
relation may be used within sets (skos:Collection
), classes (skos:Concept
) as well as
constraints (fsd:Constraint
). Similary, a fsd:constrain
relation may point to either of the three. All this
combined allows for complex nesting logic.
The following example shows a more complete set definition with various kinds of constraints, we show it both in legacy XML as well as turtle RDF:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | <set xml:id="simplepos" type="closed" xmlns="http://ilk.uvt.nl/folia"> <class xml:id="N" label="Noun"> <constrain id="constraint.2" /> </class> <class xml:id="A" label="Adjective"> <constrain id="constraint.2" /> </class> <class xml:id="V" label="Verb"> <constrain id="tense" /> <constrain id="number" /> </class> <subset xml:id="gender" type="closed"> <class xml:id="m" label="masculine" /> <class xml:id="f" label="feminine" /> <class xml:id="n" label="neuter" /> <constrain id="constraint.1" /> </subset> <subset xml:id="case" type="closed"> <class xml:id="nom" label="nominative" /> <class xml:id="gen" label="genitive" /> <class xml:id="dat" label="dative" /> <class xml:id="acc" label="accusative" /> <constrain id="constraint.1" /> </subset> <subset xml:id="number" type="closed"> <class xml:id="s" label="singular" /> <class xml:id="p" label="plural" /> </subset> <subset xml:id="tense" type="closed"> <class xml:id="present" label="present" /> <class xml:id="past" label="past" /> <constrain id="V" /> </subset> <constraint xml:id="constraint.1" type="any"> <!-- This is a constraint expressing which classes the subset using this constraint is valid --> <constrain id="N" /> <constrain id="A" /> </constraint> <constraint xml:id="constraint.2" type="all"> <!-- This is a constraint expressing which subsets are required by the class using it--> <constrain id="gender" /> <constrain id="case" /> <constrain id="number" /> </constraint> </set> |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 | @prefix fsd: <http://folia.science.ru.nl/setdefinition#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix simplepos: <http://folia.science.ru.nl/setdefinition/simplepos#> . @prefix skos: <http://www.w3.org/2004/02/skos/core#> . @prefix xml: <http://www.w3.org/XML/1998/namespace> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . simplepos:Set a skos:Collection ; skos:member simplepos:A, simplepos:N, simplepos:Subset.case, simplepos:Subset.gender, simplepos:Subset.number, simplepos:Subset.tense, simplepos:V ; skos:notation "simplepos" . simplepos:Subset.tense a skos:Collection ; fsd:constrain simplepos:V ; skos:member simplepos:past, simplepos:present ; skos:notation "tense" . simplepos:acc a skos:Concept ; fsd:sequenceNumber 4 ; skos:notation "acc" ; skos:prefLabel "accusative" . simplepos:dat a skos:Concept ; fsd:sequenceNumber 3 ; skos:notation "dat" ; skos:prefLabel "dative" . simplepos:f a skos:Concept ; fsd:sequenceNumber 2 ; skos:notation "f" ; skos:prefLabel "feminine" . simplepos:gen a skos:Concept ; fsd:sequenceNumber 2 ; skos:notation "gen" ; skos:prefLabel "genitive" . simplepos:m a skos:Concept ; fsd:sequenceNumber 1 ; skos:notation "m" ; skos:prefLabel "masculine" . simplepos:n a skos:Concept ; fsd:sequenceNumber 3 ; skos:notation "n" ; skos:prefLabel "neuter" . simplepos:nom a skos:Concept ; fsd:sequenceNumber 1 ; skos:notation "nom" ; skos:prefLabel "nominative" . simplepos:p a skos:Concept ; fsd:sequenceNumber 2 ; skos:notation "p" ; skos:prefLabel "plural" . simplepos:past a skos:Concept ; fsd:sequenceNumber 2 ; skos:notation "past" ; skos:prefLabel "past" . simplepos:present a skos:Concept ; fsd:sequenceNumber 1 ; skos:notation "present" ; skos:prefLabel "present" . simplepos:s a skos:Concept ; fsd:sequenceNumber 1 ; skos:notation "s" ; skos:prefLabel "singular" . simplepos:A a skos:Concept ; fsd:constrain simplepos:constraint.2 ; fsd:sequenceNumber 2 ; skos:notation "A" ; skos:prefLabel "Adjective" . simplepos:N a skos:Concept ; fsd:constrain simplepos:constraint.2 ; fsd:sequenceNumber 1 ; skos:notation "N" ; skos:prefLabel "Noun" . simplepos:Subset.case a skos:Collection ; fsd:constrain simplepos:constraint.1 ; skos:member simplepos:acc, simplepos:dat, simplepos:gen, simplepos:nom ; skos:notation "case" . simplepos:Subset.gender a skos:Collection ; fsd:constrain simplepos:constraint.1 ; skos:member simplepos:f, simplepos:m, simplepos:n ; skos:notation "gender" . simplepos:Subset.number a skos:Collection ; skos:member simplepos:p, simplepos:s ; skos:notation "number" . simplepos:V a skos:Concept ; fsd:constrain simplepos:number, simplepos:tense ; fsd:sequenceNumber 3 ; skos:notation "V" ; skos:prefLabel "Verb" . simplepos:constraint.1 a fsd:Constraint ; fsd:constrain simplepos:A, simplepos:N ; fsd:constraintType "any" . simplepos:constraint.2 a fsd:Constraint ; fsd:constrain simplepos:Subset.case, simplepos:Subset.gender, simplepos:Subset.number ; fsd:constraintType "all" . |
SKOS¶
SKOS allows for more expressions to be made, and of course the full power of open linked data is available up to be used with FoLiA Set Definitions. The previous subsections layed out the minimal requirements for FoLiA Set Definitions using the SKOS model.
The use of skos:OrderedCollection
is currently not supported yet,
skos:Collection
is mandatory. Ordering of classes (SKOS Concepts) can
currently be indicated through a separate fsd:sequenceNumber
property.
FoLiA Set Definitions must be complete, that is to say that all sets (SKOS collections) and classes (SKOS concepts) must be fully defined in one and the same set definition file.
Note
The file need not be static but can be dynamically generated server-side; which must be publicly available from a URL. A set definition must contain one and only one primary set (SKOS collection), all other sets must be subsets (SKOS collections that are a member of the primary set, no deeper nesting is supported).
See also