Part-of-Speech Annotation

Part-of-Speech Annotation, one of the most common types of linguistic annotation. Assigns a lexical class to words.

Specification

Annotation Category:
 

Inline Annotation

Declaration:

<pos-annotation set="...">

Version History:
 

Since the beginning

Element:

<pos>

API Class:

PosAnnotation

Required Attributes:
 
  • set – The set of the element, ideally a URI linking to a set definition (see Set Definitions (Vocabulary)) or otherwise a uniquely identifying string. The set must be referred to also in the Annotation Declarations for this annotation type.
  • class – The class of the annotation, i.e. the annotation tag in the vocabulary defined by set.
Optional Attributes:
 
  • xml:id – The ID of the element; this has to be a unique in the entire document or collection of documents (corpus). All identifiers in FoLiA are of the XML NCName datatype, which roughly means it is a unique string that has to start with a letter (not a number or symbol), may contain numers, but may never contain colons or spaces. FoLiA does not define any naming convention for IDs.
  • set – The set of the element, ideally a URI linking to a set definition (see Set Definitions (Vocabulary)) or otherwise a uniquely identifying string. The set must be referred to also in the Annotation Declarations for this annotation type.
  • class – The class of the annotation, i.e. the annotation tag in the vocabulary defined by set.
  • processor – This refers to the ID of a processor in the Provenance Data. The processor in turn defines exactly who or what was the annotator of the annotation.
  • annotator – This is an older alternative to the processor attribute, without support for full provenance. The annotator attribute simply refers to the name o ID of the system or human annotator that made the annotation.
  • annotatortype – This is an older alternative to the processor attribute, without support for full provenance. It is used together with annotator and specific the type of the annotator, either manual for human annotators or auto for automated systems.
  • confidence – A floating point value between zero and one; expresses the confidence the annotator places in his annotation.
  • datetime – The date and time when this annotation was recorded, the format is YYYY-MM-DDThh:mm:ss (note the literal T in the middle to separate date from time), as per the XSD Datetime data type.
  • n – A number in a sequence, corresponding to a number in the original document, for example chapter numbers, section numbers, list item numbers. This this not have to be an actual number but other sequence identifiers are also possible (think alphanumeric characters or roman numerals).
  • textclass – Refers to the text class this annotation is based on. This is an advanced attribute, if not specified, it defaults to current. See Text class attribute (advanced).
  • src – Points to a file or full URL of a sound or video file. This attribute is inheritable.
  • begintime – A timestamp in HH:MM:SS.MMM format, indicating the begin time of the speech. If a sound clip is specified (src); the timestamp refers to a location in the soundclip.
  • endtime – A timestamp in HH:MM:SS.MMM format, indicating the end time of the speech. If a sound clip is specified (src); the timestamp refers to a location in the soundclip.
  • speaker – A string identifying the speaker. This attribute is inheritable. Multiple speakers are not allowed, simply do not specify a speaker on a certain level if you are unable to link the speech to a specific (single) speaker.
Accepted Data:

<comment> (Comment Annotation), <desc> (Description Annotation), <metric> (Metric Annotation)

Valid Context:
Feature subsets (extra attributes):
 
  • head

Explanation & Examples

Part-of-Speech annotation allows the annotation of lexical categories using the pos element. The following example shows a simple part-of-speech annotation. In this example , we declare PoS annotation to use the tagset from the brown corpus (although we do not have an actual set definition for it).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
<?xml version="1.0" encoding="utf-8"?>
<FoLiA xmlns="http://ilk.uvt.nl/folia" version="2.0" xml:id="example">
  <metadata>
      <annotations>
          <text-annotation />
          <token-annotation set="https://raw.githubusercontent.com/LanguageMachines/uctodata/master/setdefinitions/tokconfig-eng.foliaset.ttl">
			 <annotator processor="p1" />
		  </token-annotation>
          <sentence-annotation>
			 <annotator processor="p1" />
          </sentence-annotation>
          <paragraph-annotation>
			 <annotator processor="p1" />
          </paragraph-annotation>
          <pos-annotation set="brown"> <!-- This is an ad-hoc set declaration as it is no URL and therefore not really defined -->
			 <annotator processor="p1" />
          </pos-annotation>
      </annotations>
      <provenance>
         <processor xml:id="p1" name="proycon" type="manual" />
      </provenance>
  </metadata>
  <text xml:id="example.text">
    <s xml:id="example.p.1.s.2">
     <w xml:id="example.p.1.s.2.w.1" class="WORD">
        <t>This</t>
        <pos class="DT"/>
     </w>
     <w xml:id="example.p.1.s.2.w.2" class="WORD">
        <t>is</t>
        <pos class="VBZ"/>
     </w>
     <w xml:id="example.p.1.s.2.w.3" class="WORD">
        <t>an</t>
        <pos class="AT"/>
     </w>
     <w xml:id="example.p.1.s.2.w.4" class="WORD" space="no">
        <t>example</t>
        <pos class="NN"/>
     </w>
     <w xml:id="example.p.1.s.2.w.5" class="PUNCTUATION">
        <t>.</t>
        <pos class="."/>
     </w>
    </s>
  </text>
</FoLiA>

Lexical annotation can take more complex forms than assignment of a single part-of-speech tag. There may for example be numerous features associated with the part-of-speech tag, such as gender, number, case, tense, mood, etc… FoLiA introduces a special paradigm for dealing with such features. This is described in Features, please ensure you are familiar with this before reading the remainder of this section.

Two scenarios can be envisioned, one in which the class of the pos element encodes all features, and one in which it is the foundation upon which is expanded. Which one is used is entirely up to the defined set.

Option one:

<w xml:id="example.p.1.s.1.w.2">
    <t>boot</t>
    <pos head="N" class="N(singular)">
        <feat subset="number" class="singular" />
        <feat subset="gender" class="none" />
        <feat subset="case" class="none" />
    </pos>
</w>

In FoLiA, this attribute head is a predefined subset for PoS-annotation, i.e. the subset is commonly used and has clear semantics; however, it still needs to be defined in the set definition. We can use such predefined subsets as XML attributes.

Option two:

<w xml:id="example.p.1.s.1.w.2">
    <t>boot</t>
    <pos class="N">
        <feat subset="number" class="singular" />
        <feat subset="gender" class="none" />
        <feat subset="case" class="none" />
    </pos>
</w>

The last examples demonstrates a full FoLiA document with part-of-speech tagging with features:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="example.deep" generator="libfolia-v1.5" version="2.0.0">
  <metadata type="native">
    <annotations>
      <text-annotation>
			 <annotator processor="p1" />
      </text-annotation>
      <sentence-annotation>
			 <annotator processor="p1" />
      </sentence-annotation>
      <token-annotation set="https://raw.githubusercontent.com/LanguageMachines/uctodata/folia1.4/setdefinitions/tokconfig-nld.foliaset.ttl">
			 <annotator processor="p2" />
      </token-annotation>
      <pos-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/frog-mbpos-cgn">
			 <annotator processor="p3.1" />
      </pos-annotation>
      <lemma-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/frog-mblem-nl">
			 <annotator processor="p3.2" />
      </lemma-annotation>
    </annotations>
    <provenance>
       <processor xml:id="p1" name="proycon" type="manual" />
       <processor xml:id="p2" name="ucto" version="0.14" />
       <processor xml:id="p3" name="frog" version="0.16" begindatetime="2016-11-15T15:12:00">
           <processor xml:id="p3.0" name="libfolia" version="1.14" type="generator" />
           <processor xml:id="p3.1" name="mbpos" version="1.0" />
           <processor xml:id="p3.2" name="mblem" version="1.1" />
       </processor>
    </provenance>
    <meta id="language">nld</meta>
  </metadata>
  <text xml:id="example.deep.text">
      <s xml:id="example.deep.p.1.s.1">
        <t>De Russen kennen Nova Zembla sinds de 11e of 12e eeuw, toen handelaars van Novgorod het eiland al aandeden.</t>
        <w xml:id="example.deep.p.1.s.1.w.1" class="WORD">
          <t>De</t>
          <pos class="LID(bep,stan,rest)" confidence="0.779762" head="LID">
            <feat class="bep" subset="lwtype"/>
            <feat class="stan" subset="naamval"/>
            <feat class="rest" subset="npagr"/>
          </pos>
          <lemma class="de"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.2" class="WORD">
          <t>Russen</t>
          <pos class="SPEC(deeleigen)" confidence="1" head="SPEC">
            <feat class="deeleigen" subset="spectype"/>
          </pos>
          <lemma class="Russen"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.3" class="WORD">
          <t>kennen</t>
          <pos class="WW(pv,tgw,mv)" confidence="0.833333" head="WW">
            <feat class="pv" subset="wvorm"/>
            <feat class="tgw" subset="pvtijd"/>
            <feat class="mv" subset="pvagr"/>
          </pos>
          <lemma class="kennen"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.4" class="WORD">
          <t>Nova</t>
          <pos class="SPEC(deeleigen)" confidence="1" head="SPEC">
            <feat class="deeleigen" subset="spectype"/>
          </pos>
          <lemma class="Nova"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.5" class="WORD">
          <t>Zembla</t>
          <pos class="SPEC(deeleigen)" confidence="1" head="SPEC">
            <feat class="deeleigen" subset="spectype"/>
          </pos>
          <lemma class="Zembla"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.6" class="WORD">
          <t>sinds</t>
          <pos class="VZ(init)" confidence="0.999078" head="VZ">
            <feat class="init" subset="vztype"/>
          </pos>
          <lemma class="sinds"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.7" class="WORD">
          <t>de</t>
          <pos class="LID(bep,stan,rest)" confidence="0.981886" head="LID">
            <feat class="bep" subset="lwtype"/>
            <feat class="stan" subset="naamval"/>
            <feat class="rest" subset="npagr"/>
          </pos>
          <lemma class="de"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.8" class="NUMBER-ORDINAL">
          <t>11e</t>
          <pos class="TW(rang,prenom,stan)" confidence="0.990632" head="TW">
            <feat class="rang" subset="numtype"/>
            <feat class="prenom" subset="positie"/>
            <feat class="stan" subset="naamval"/>
          </pos>
          <lemma class="11"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.9" class="WORD">
          <t>of</t>
          <pos class="VG(neven)" confidence="0.855677" head="VG">
            <feat class="neven" subset="conjtype"/>
          </pos>
          <lemma class="of"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.10" class="NUMBER-ORDINAL">
          <t>12e</t>
          <pos class="TW(rang,prenom,stan)" confidence="0.990632" head="TW">
            <feat class="rang" subset="numtype"/>
            <feat class="prenom" subset="positie"/>
            <feat class="stan" subset="naamval"/>
          </pos>
          <lemma class="12"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.11" class="WORD" space="no">
          <t>eeuw</t>
          <pos class="N(soort,ev,basis,zijd,stan)" confidence="0.999633" head="N">
            <feat class="soort" subset="ntype"/>
            <feat class="ev" subset="getal"/>
            <feat class="basis" subset="graad"/>
            <feat class="zijd" subset="genus"/>
            <feat class="stan" subset="naamval"/>
          </pos>
          <lemma class="eeuw"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.12" class="PUNCTUATION">
          <t>,</t>
          <pos class="LET()" confidence="1" head="LET"/>
          <lemma class=","/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.13" class="WORD">
          <t>toen</t>
          <pos class="VG(onder)" confidence="0.571429" head="VG">
            <feat class="onder" subset="conjtype"/>
          </pos>
          <lemma class="toen"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.14" class="WORD">
          <t>handelaars</t>
          <pos class="N(soort,mv,basis)" confidence="0.99944" head="N">
            <feat class="soort" subset="ntype"/>
            <feat class="mv" subset="getal"/>
            <feat class="basis" subset="graad"/>
          </pos>
          <lemma class="handelaar"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.15" class="WORD">
          <t>van</t>
          <pos class="VZ(init)" confidence="0.999469" head="VZ">
            <feat class="init" subset="vztype"/>
          </pos>
          <lemma class="van"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.16" class="WORD">
          <t>Novgorod</t>
          <pos class="SPEC(deeleigen)" confidence="1" head="SPEC">
            <feat class="deeleigen" subset="spectype"/>
          </pos>
          <lemma class="Novgorod"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.17" class="WORD">
          <t>het</t>
          <pos class="LID(bep,stan,evon)" confidence="0.996855" head="LID">
            <feat class="bep" subset="lwtype"/>
            <feat class="stan" subset="naamval"/>
            <feat class="evon" subset="npagr"/>
          </pos>
          <lemma class="het"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.18" class="WORD">
          <t>eiland</t>
          <pos class="N(soort,ev,basis,onz,stan)" confidence="0.996804" head="N">
            <feat class="soort" subset="ntype"/>
            <feat class="ev" subset="getal"/>
            <feat class="basis" subset="graad"/>
            <feat class="onz" subset="genus"/>
            <feat class="stan" subset="naamval"/>
          </pos>
          <lemma class="eiland"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.19" class="WORD">
          <t>al</t>
          <pos class="BW()" confidence="0.90383" head="BW"/>
          <lemma class="al"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.20" class="WORD" space="no">
          <t>aandeden</t>
          <pos class="WW(pv,verl,mv)" confidence="0.999559" head="WW">
            <feat class="pv" subset="wvorm"/>
            <feat class="verl" subset="pvtijd"/>
            <feat class="mv" subset="pvagr"/>
          </pos>
          <lemma class="aandoen"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.21" class="PUNCTUATION">
          <t>.</t>
          <pos class="LET()" confidence="1" head="LET"/>
          <lemma class="."/>
        </w>
      </s>
  </text>
</FoLiA>