Provenance Data

It is often desireable to know exactly what tools (and what versions thereof and even with what parameters) were invoked in which order to produce a FoLiA document, this is called provenance data. In the metadata section, right after the Annotation Declarations FoLiA allows for a <provenance> block containing this information. It is not mandatory but it is strongly recommended.

The <provenance> block defines one or more processors, processors are processes or entities that have processed and often performend some kind of manipulation of the document, such as adding annotations. The processors are listed in the order they were invoked. The Annotation Declarations in turn link to these processors to tie a particular annotation type and set to one or more processors.

A <processor> carries the following attributes:
  • xml:id (mandatory) – The ID of the processor, this is how it is referred to from the <annotator processor=".." /> element in the Annotation Declarations and from the processor attribute (part of the common FoLiA attributes) on individual annotations.
  • name (mandatory) – The name identifies actual tool or human annotator
  • type – Each processor contains a type:
    • auto - (default) - The processor is an automated tool that provided annotations
    • manual - The processor refers a manual annotator
    • generator - The processor indicates the FoLiA library used by the parent and sibling processors (unless sibling processes specify another generator in their scope)
    • datasource - The processor is a reference to a particular data source that was used by the parent processor. If there is no parent processor but it is instead directly part of the provenance chain, often as the very first element, then you can interpret this to be the original data source from which the document sprung.
  • version – (optional but strongly recommended) is the version of the processor aka tool
  • document_version (optional) – The version of the document, refers to any label the user desires to indicate a version of the document, so the format is not predetermined and needs not be numeric.
  • command (optional) – The exact command that was run
  • host (optional) – The host on which the processor ran, this identifies individual systems on a network/cluster.
  • user (optional) – The user/executor which ran the processor, this identifies who ran an automated process rather than who the annotator was!
  • src (optional) – The source of the processor, a URL to the tool itself in case the software is an online tool, or to its website or source code repository if not. If the processor is of the datasource type, then this attribute should point to that data set or a website describing it. The format attribute can be used to further specify the type of source.
  • format (optional) – MIME type describing the kind of resource pointed to by src. Use text/html for websites. Especially useful for processors of type datasource.
  • folia_version (optional) - The folia version that was written
  • begindatetime (optional) – Specifies when the process started, format is YYYY-MM-DDThh:mm:ss (note the literal T in the middle to separate date from time), as per the XSD Datetime data type.
  • enddatetime (optional) – Specifies when the process finished, format is YYYY-MM-DDThh:mm:ss (note the literal T in the middle to separate date from time), as per the XSD Datetime data type.
  • resourcelink (optional) - The URI of any RDF resource describing this processor. This allows linking to the external world of linked open data from the provenance chain in FoLiA.
  • Additional custom metadata is allowed in the form of <meta> elements (just like with folia native metadata) inside the scope of a processor, FoLiA does not define the semantics of any such metadata, i.e. they are tool/application-specific and could for instance be used to specify tool parameters used.

First consider a fairly minimalistic example, note that we include the Annotation Declarations as well with a link to the processor:

<annotations>
  <token-annotation set="tokconfig-nl">
      <annotator processor="p0" />
  </token-annotation>
</annotations>
<provenance>
    <processor xml:id="p0" name="ucto" version="0.15" folia_version="2.0" command="ucto -Lnld" host="mhysa" user="proycon" begindatetime="2018-09-12T00:00:00" enddatetime="2018-09-12T00:00:10" document_version="1" />
</provenance>

Individual annotations in the document can refer to this processor using the processor attribute:

<w class="PUNCTUATION" processor="p0">
 <t>.</t>
</w>

If there is only one <annotator> defined for a certain annotation type and set in the Annotation Declarations, then it is the default and no processor attribute is necessary.

One of the powerful features of processors is that they can be nested, this creates subprocessors and captures situations where one processor invokes others as part of its operation. Subprocessors can also provide some extra information on their parent processor, as they can for example state what FoLiA library was used (type="generator") or what data sources were used by the processor (type="datasource"). Moreover, arbitrary metadata can be added to any processor in the form of <meta> elements (just like with FoLiA’s native Metadata), FoLiA does not define the semantics of any such metadata, i.e. they are tool/application-specific and could for instance be used to specify tool parameters used. Note that whereas the order of the processors in the <provenance> block is strictly significant, the order of subprocessors is not.

With all this in mind, we can expand our previous example:

<provenance>
    <processor xml:id="p0" name="ucto" version="0.15" folia_version="2.0" command="ucto -Lnld" host="mhysa" user="proycon" begindatetime="2018-09-12T00:00:00" enddatetime="2018-09-12T00:00:10" document_version="1" />
        <meta id="config">tokconfig-nld</meta>
        <meta id="language">nld</meta>
        <processor xml:id="p0.1" name="libfolia" version="2.0" folia_version="2.0" type="generator" />
        <processor xml:id="p0.1" name="tokconfig-nld" version="2.0" folia_version="2.0" type="datasource" />
    </processor>
</provenance>

Or consider the following example in which we have a tool that is an annotation environment in which human annotators edit a FoLiA document and add/edit annotations:

<provenance>
    <processor xml:id="p2" name="flat" version="0.8" folia_version="2.0" host="flat.science.ru.nl" begindatetime="2018-09-12T00:10:00" enddatetime="2018-09-12T00:20:00" document_version="3">
        <processor xml:id="p2.0" name="foliapy" version="2.0" folia_version="2.0" type="generator" />
        <processor xml:id="p2.1" name="proycon" type="manual" />
        <processor xml:id="p2.2" name="ko" type="manual" />
    </processor>
</provenance>

From the Annotation Declarations, we can then also refer directly to subprocessors. Moreover, a processor can be referred to from multiple annotation types/sets:

<annotations>
  ...
  <pos-annotation set="...">
      <annotator processor="p2.1" />
      <annotator processor="p2.2" />
  </pos-annotation>
  <lemma-annotation set="...">
      <annotator processor="p2.1" />
  </lemma-annotation>
  ...
</annotations>

Of course, providing all this is not mandatory and requires the specific tool to actually supply this provenance data. It is still possible to have FoLiA documents without provenance data at all.

The following example provides a small but complete FoLiA document with provenance data:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="untitled" generator="manual" version="2.0.0">
  <metadata type="native">
    <annotations>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <paragraph-annotation />
      <sentence-annotation />
      <token-annotation set="tokconfig-nl">
          <annotator processor="p0" />
      </token-annotation>
      <pos-annotation set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn">
          <!-- There are multiple annotators, this means that each pos annotation should explicitly refer to one of them using the @processor attribute -->
          <annotator processor="p1.1" />
          <annotator processor="p2.1" />
          <annotator processor="p2.2" />
      </pos-annotation>
      <lemma-annotation set="http://ilk.uvt.nl/folia/sets/frog-mblem-nl">
          <!-- There is only one annotator so this will be the default, no need to explicitly refer to it from lemma annotations using the @processor attribute -->
          <annotator processor="p1.2" />
      </lemma-annotation>
    </annotations>
    <provenance>
        <processor xml:id="p0" name="ucto" version="0.15" folia_version="2.0" command="ucto -Lnld" host="mhysa" user="proycon" src="https://github.com/LanguageMachines/ucto" begindatetime="2018-09-12T00:00:00" enddatetime="2018-09-12T00:00:00" document_version="1">
            <!-- We can add arbitrary meta fields to any processor, they are not defined by FoLiA but application-specific  -->
            <meta id="config">tokconfig-nld</meta>
            <meta id="language">nld</meta>
            <processor xml:id="p0.1" name="libfolia" version="2.0" folia_version="2.0" type="generator" />
        </processor>
        <processor xml:id="p1" name="frog" version="0.16" folia_version="2.0" command="frog --skip=pn" host="mhysa" user="proycon" src="https://github.com/LanguageMachines/frog" begindatetime="2018-09-12T00:01:00" enddatetime="2018-09-12T00:02:00" document_version="2">
            <processor xml:id="p1.0" name="libfolia" version="2.0" folia_version="2.0" type="generator" />
            <processor xml:id="p1.1" name="mbpos" version="0.16">
                  <processor xml:id="p1.1.1" type="datasource" name="CGN Corpus" version="unknown" />
                  <processor xml:id="p1.1.2" type="datasource" name="WOTAN Corpus" version="unknown" />
                  <processor xml:id="p1.1.3" type="datasource" name="DCOI Corpus" version="unknown" />
                  <processor xml:id="p1.1.4" type="datasource" name="Lassy Klein Corpus" version="unknown" />
            </processor>
            <processor xml:id="p1.2" name="mblem" />
        </processor>
        <processor xml:id="p2" name="flat" version="0.8" folia_version="2.0" host="flat.science.ru.nl" src="https://flat.science.ru.nl" begindatetime="2018-09-12T00:10:00" enddatetime="2018-09-12T00:20:00" document_version="3">
            <processor xml:id="p2.0" name="foliapy" version="2.0" folia_version="2.0" type="generator" src="https://github.com/proycon/foliapy" />
            <processor xml:id="p2.1" name="proycon" type="manual" />
            <processor xml:id="p2.2" name="ko" type="manual" />
        </processor>
    </provenance>
  </metadata>
  <text xml:id="untitled.text">
    <p xml:id="untitled.p.1">
      <s xml:id="untitled.p.1.s.1">
        <t>De belastingdienst doet aangifte tegen frauderende mensen.</t>
        <w xml:id="untitled.p.1.s.1.w.1" class="WORD">
          <t>De</t>
          <pos class="LID(bep,stan,rest)" confidence="0.999701" head="LID" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" processor="p1.1">
            <feat class="bep" subset="lwtype"/>
            <feat class="stan" subset="naamval"/>
            <feat class="rest" subset="npagr"/>
          </pos>
          <lemma class="de"/>
        </w>
        <w xml:id="untitled.p.1.s.1.w.2" class="WORD">
          <t>belastingdienst</t>
          <pos class="N(soort,ev,basis,zijd,stan)" confidence="0.998836" head="N" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" processor="p2.1">
            <feat class="soort" subset="ntype"/>
            <feat class="ev" subset="getal"/>
            <feat class="basis" subset="graad"/>
            <feat class="zijd" subset="genus"/>
            <feat class="stan" subset="naamval"/>
          </pos>
          <lemma class="belastingdienst"/>
        </w>
        <w xml:id="untitled.p.1.s.1.w.3" class="WORD">
          <t>doet</t>
          <pos class="WW(pv,tgw,met-t)" confidence="0.999262" head="WW" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" processor="p1.1">
            <feat class="pv" subset="wvorm"/>
            <feat class="tgw" subset="pvtijd"/>
            <feat class="met-t" subset="pvagr"/>
          </pos>
          <lemma class="doen"/>
        </w>
        <w xml:id="untitled.p.1.s.1.w.4" class="WORD">
          <t>aangifte</t>
          <pos class="N(soort,ev,basis,zijd,stan)" confidence="0.998701" head="N" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" processor="p2.2">
            <feat class="soort" subset="ntype"/>
            <feat class="ev" subset="getal"/>
            <feat class="basis" subset="graad"/>
            <feat class="zijd" subset="genus"/>
            <feat class="stan" subset="naamval"/>
          </pos>
          <lemma class="aangifte"/>
        </w>
        <w xml:id="untitled.p.1.s.1.w.5" class="WORD">
          <t>tegen</t>
          <pos class="VZ(init)" confidence="0.854093" head="VZ" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" processor="p1.1">
            <feat class="init" subset="vztype"/>
          </pos>
          <lemma class="tegen"/>
        </w>
        <w xml:id="untitled.p.1.s.1.w.6" class="WORD">
          <t>frauderende</t>
          <pos class="WW(od,prenom,met-e)" confidence="0.96" head="WW" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" processor="p1.1">
            <feat class="od" subset="wvorm"/>
            <feat class="prenom" subset="positie"/>
            <feat class="met-e" subset="buiging"/>
          </pos>
          <lemma class="frauderen"/>
        </w>
        <w xml:id="untitled.p.1.s.1.w.7" class="WORD" space="no">
          <t>mensen</t>
          <pos class="N(soort,mv,basis)" confidence="0.999865" head="N" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" processor="p1.1">
            <feat class="soort" subset="ntype"/>
            <feat class="mv" subset="getal"/>
            <feat class="basis" subset="graad"/>
          </pos>
          <lemma class="mens"/>
        </w>
        <w xml:id="untitled.p.1.s.1.w.8" class="PUNCTUATION">
          <t>.</t>
          <pos class="LET()" confidence="1" head="LET" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" processor="p1.1"/>
          <lemma class="."/>
        </w>
      </s>
    </p>
  </text>
</FoLiA>

And another more real-life example:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="example.deep" generator="libfolia-v1.5" version="2.0.0">
  <metadata type="native">
    <annotations>
      <text-annotation>
			 <annotator processor="p1" />
      </text-annotation>
      <sentence-annotation>
			 <annotator processor="p1" />
      </sentence-annotation>
      <token-annotation set="https://raw.githubusercontent.com/LanguageMachines/uctodata/folia1.4/setdefinitions/tokconfig-nld.foliaset.ttl">
			 <annotator processor="p2" />
      </token-annotation>
      <pos-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/frog-mbpos-cgn">
			 <annotator processor="p3.1" />
      </pos-annotation>
      <lemma-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/frog-mblem-nl">
			 <annotator processor="p3.2" />
      </lemma-annotation>
    </annotations>
    <provenance>
       <processor xml:id="p1" name="proycon" type="manual" />
       <processor xml:id="p2" name="ucto" version="0.14" />
       <processor xml:id="p3" name="frog" version="0.16" begindatetime="2016-11-15T15:12:00">
           <processor xml:id="p3.0" name="libfolia" version="1.14" type="generator" />
           <processor xml:id="p3.1" name="mbpos" version="1.0" />
           <processor xml:id="p3.2" name="mblem" version="1.1" />
       </processor>
    </provenance>
    <meta id="language">nld</meta>
  </metadata>
  <text xml:id="example.deep.text">
      <s xml:id="example.deep.p.1.s.1">
        <t>De Russen kennen Nova Zembla sinds de 11e of 12e eeuw, toen handelaars van Novgorod het eiland al aandeden.</t>
        <w xml:id="example.deep.p.1.s.1.w.1" class="WORD">
          <t>De</t>
          <pos class="LID(bep,stan,rest)" confidence="0.779762" head="LID">
            <feat class="bep" subset="lwtype"/>
            <feat class="stan" subset="naamval"/>
            <feat class="rest" subset="npagr"/>
          </pos>
          <lemma class="de"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.2" class="WORD">
          <t>Russen</t>
          <pos class="SPEC(deeleigen)" confidence="1" head="SPEC">
            <feat class="deeleigen" subset="spectype"/>
          </pos>
          <lemma class="Russen"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.3" class="WORD">
          <t>kennen</t>
          <pos class="WW(pv,tgw,mv)" confidence="0.833333" head="WW">
            <feat class="pv" subset="wvorm"/>
            <feat class="tgw" subset="pvtijd"/>
            <feat class="mv" subset="pvagr"/>
          </pos>
          <lemma class="kennen"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.4" class="WORD">
          <t>Nova</t>
          <pos class="SPEC(deeleigen)" confidence="1" head="SPEC">
            <feat class="deeleigen" subset="spectype"/>
          </pos>
          <lemma class="Nova"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.5" class="WORD">
          <t>Zembla</t>
          <pos class="SPEC(deeleigen)" confidence="1" head="SPEC">
            <feat class="deeleigen" subset="spectype"/>
          </pos>
          <lemma class="Zembla"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.6" class="WORD">
          <t>sinds</t>
          <pos class="VZ(init)" confidence="0.999078" head="VZ">
            <feat class="init" subset="vztype"/>
          </pos>
          <lemma class="sinds"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.7" class="WORD">
          <t>de</t>
          <pos class="LID(bep,stan,rest)" confidence="0.981886" head="LID">
            <feat class="bep" subset="lwtype"/>
            <feat class="stan" subset="naamval"/>
            <feat class="rest" subset="npagr"/>
          </pos>
          <lemma class="de"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.8" class="NUMBER-ORDINAL">
          <t>11e</t>
          <pos class="TW(rang,prenom,stan)" confidence="0.990632" head="TW">
            <feat class="rang" subset="numtype"/>
            <feat class="prenom" subset="positie"/>
            <feat class="stan" subset="naamval"/>
          </pos>
          <lemma class="11"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.9" class="WORD">
          <t>of</t>
          <pos class="VG(neven)" confidence="0.855677" head="VG">
            <feat class="neven" subset="conjtype"/>
          </pos>
          <lemma class="of"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.10" class="NUMBER-ORDINAL">
          <t>12e</t>
          <pos class="TW(rang,prenom,stan)" confidence="0.990632" head="TW">
            <feat class="rang" subset="numtype"/>
            <feat class="prenom" subset="positie"/>
            <feat class="stan" subset="naamval"/>
          </pos>
          <lemma class="12"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.11" class="WORD" space="no">
          <t>eeuw</t>
          <pos class="N(soort,ev,basis,zijd,stan)" confidence="0.999633" head="N">
            <feat class="soort" subset="ntype"/>
            <feat class="ev" subset="getal"/>
            <feat class="basis" subset="graad"/>
            <feat class="zijd" subset="genus"/>
            <feat class="stan" subset="naamval"/>
          </pos>
          <lemma class="eeuw"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.12" class="PUNCTUATION">
          <t>,</t>
          <pos class="LET()" confidence="1" head="LET"/>
          <lemma class=","/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.13" class="WORD">
          <t>toen</t>
          <pos class="VG(onder)" confidence="0.571429" head="VG">
            <feat class="onder" subset="conjtype"/>
          </pos>
          <lemma class="toen"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.14" class="WORD">
          <t>handelaars</t>
          <pos class="N(soort,mv,basis)" confidence="0.99944" head="N">
            <feat class="soort" subset="ntype"/>
            <feat class="mv" subset="getal"/>
            <feat class="basis" subset="graad"/>
          </pos>
          <lemma class="handelaar"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.15" class="WORD">
          <t>van</t>
          <pos class="VZ(init)" confidence="0.999469" head="VZ">
            <feat class="init" subset="vztype"/>
          </pos>
          <lemma class="van"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.16" class="WORD">
          <t>Novgorod</t>
          <pos class="SPEC(deeleigen)" confidence="1" head="SPEC">
            <feat class="deeleigen" subset="spectype"/>
          </pos>
          <lemma class="Novgorod"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.17" class="WORD">
          <t>het</t>
          <pos class="LID(bep,stan,evon)" confidence="0.996855" head="LID">
            <feat class="bep" subset="lwtype"/>
            <feat class="stan" subset="naamval"/>
            <feat class="evon" subset="npagr"/>
          </pos>
          <lemma class="het"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.18" class="WORD">
          <t>eiland</t>
          <pos class="N(soort,ev,basis,onz,stan)" confidence="0.996804" head="N">
            <feat class="soort" subset="ntype"/>
            <feat class="ev" subset="getal"/>
            <feat class="basis" subset="graad"/>
            <feat class="onz" subset="genus"/>
            <feat class="stan" subset="naamval"/>
          </pos>
          <lemma class="eiland"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.19" class="WORD">
          <t>al</t>
          <pos class="BW()" confidence="0.90383" head="BW"/>
          <lemma class="al"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.20" class="WORD" space="no">
          <t>aandeden</t>
          <pos class="WW(pv,verl,mv)" confidence="0.999559" head="WW">
            <feat class="pv" subset="wvorm"/>
            <feat class="verl" subset="pvtijd"/>
            <feat class="mv" subset="pvagr"/>
          </pos>
          <lemma class="aandoen"/>
        </w>
        <w xml:id="example.deep.p.1.s.1.w.21" class="PUNCTUATION">
          <t>.</t>
          <pos class="LET()" confidence="1" head="LET"/>
          <lemma class="."/>
        </w>
      </s>
  </text>
</FoLiA>

Another example with many annotation types and extensive provenance data: