Provenance Data¶
It is often desireable to know exactly what tools (and what versions thereof and even with what parameters) were invoked
in which order to produce a FoLiA document, this is called provenance data. In the metadata section, right after the
Annotation Declarations FoLiA allows for a <provenance>
block containing this information. It is not
mandatory but it is strongly recommended.
The <provenance>
block defines one or more processors, processors are processes or entities that have processed
and often performend some kind of manipulation of the document, such as adding annotations. The processors are listed in
the order they were invoked. The Annotation Declarations in turn link to these processors to tie a particular
annotation type and set to one or more processors.
- A
<processor>
carries the following attributes: xml:id
(mandatory) – The ID of the processor, this is how it is referred to from the<annotator processor=".." />
element in the Annotation Declarations and from theprocessor
attribute (part of the common FoLiA attributes) on individual annotations.name
(mandatory) – The name identifies actual tool or human annotatortype
– Each processor contains a type:auto
- (default) - The processor is an automated tool that provided annotationsmanual
- The processor refers a manual annotatorgenerator
- The processor indicates the FoLiA library used by the parent and sibling processors (unless sibling processes specify another generator in their scope)datasource
- The processor is a reference to a particular data source that was used by the parent processor. If there is no parent processor but it is instead directly part of the provenance chain, often as the very first element, then you can interpret this to be the original data source from which the document sprung.
version
– (optional but strongly recommended) is the version of the processor aka tooldocument_version
(optional) – The version of the document, refers to any label the user desires to indicate a version of the document, so the format is not predetermined and needs not be numeric.command
(optional) – The exact command that was runhost
(optional) – The host on which the processor ran, this identifies individual systems on a network/cluster.user
(optional) – The user/executor which ran the processor, this identifies who ran an automated process rather than who the annotator was!src
(optional) – The source of the processor, a URL to the tool itself in case the software is an online tool, or to its website or source code repository if not. If the processor is of thedatasource
type, then this attribute should point to that data set or a website describing it. Theformat
attribute can be used to further specify the type of source.format
(optional) – MIME type describing the kind of resource pointed to bysrc
. Usetext/html
for websites. Especially useful for processors of typedatasource
.folia_version
(optional) - The folia version that was writtenbegindatetime
(optional) – Specifies when the process started, format isYYYY-MM-DDThh:mm:ss
(note the literal T in the middle to separate date from time), as per the XSD Datetime data type.enddatetime
(optional) – Specifies when the process finished, format isYYYY-MM-DDThh:mm:ss
(note the literal T in the middle to separate date from time), as per the XSD Datetime data type.resourcelink
(optional) - The URI of any RDF resource describing this processor. This allows linking to the external world of linked open data from the provenance chain in FoLiA.- Additional custom metadata is allowed in the form of
<meta>
elements (just like with folia native metadata) inside the scope of a processor, FoLiA does not define the semantics of any such metadata, i.e. they are tool/application-specific and could for instance be used to specify tool parameters used.
First consider a fairly minimalistic example, note that we include the Annotation Declarations as well with a link to the processor:
<annotations>
<token-annotation set="tokconfig-nl">
<annotator processor="p0" />
</token-annotation>
</annotations>
<provenance>
<processor xml:id="p0" name="ucto" version="0.15" folia_version="2.0" command="ucto -Lnld" host="mhysa" user="proycon" begindatetime="2018-09-12T00:00:00" enddatetime="2018-09-12T00:00:10" document_version="1" />
</provenance>
Individual annotations in the document can refer to this processor using the processor
attribute:
<w class="PUNCTUATION" processor="p0">
<t>.</t>
</w>
If there is only one <annotator>
defined for a certain annotation type and set in the
Annotation Declarations, then it is the default and no processor
attribute is necessary.
One of the powerful features of processors is that they can be nested, this creates subprocessors and captures
situations where one processor invokes others as part of its operation. Subprocessors can also provide some extra
information on their parent processor, as they can for example state what FoLiA library was used (type="generator"
)
or what data sources were used by the processor (type="datasource"
). Moreover, arbitrary metadata can be added to
any processor in the form of <meta>
elements (just like with FoLiA’s native Metadata), FoLiA does not define
the semantics of any such metadata, i.e. they are tool/application-specific and could for instance be used to specify
tool parameters used. Note that whereas the order of the processors in the <provenance> block is strictly significant,
the order of subprocessors is not.
With all this in mind, we can expand our previous example:
<provenance>
<processor xml:id="p0" name="ucto" version="0.15" folia_version="2.0" command="ucto -Lnld" host="mhysa" user="proycon" begindatetime="2018-09-12T00:00:00" enddatetime="2018-09-12T00:00:10" document_version="1" />
<meta id="config">tokconfig-nld</meta>
<meta id="language">nld</meta>
<processor xml:id="p0.1" name="libfolia" version="2.0" folia_version="2.0" type="generator" />
<processor xml:id="p0.1" name="tokconfig-nld" version="2.0" folia_version="2.0" type="datasource" />
</processor>
</provenance>
Or consider the following example in which we have a tool that is an annotation environment in which human annotators edit a FoLiA document and add/edit annotations:
<provenance>
<processor xml:id="p2" name="flat" version="0.8" folia_version="2.0" host="flat.science.ru.nl" begindatetime="2018-09-12T00:10:00" enddatetime="2018-09-12T00:20:00" document_version="3">
<processor xml:id="p2.0" name="foliapy" version="2.0" folia_version="2.0" type="generator" />
<processor xml:id="p2.1" name="proycon" type="manual" />
<processor xml:id="p2.2" name="ko" type="manual" />
</processor>
</provenance>
From the Annotation Declarations, we can then also refer directly to subprocessors. Moreover, a processor can be referred to from multiple annotation types/sets:
<annotations>
...
<pos-annotation set="...">
<annotator processor="p2.1" />
<annotator processor="p2.2" />
</pos-annotation>
<lemma-annotation set="...">
<annotator processor="p2.1" />
</lemma-annotation>
...
</annotations>
Of course, providing all this is not mandatory and requires the specific tool to actually supply this provenance data. It is still possible to have FoLiA documents without provenance data at all.
The following example provides a small but complete FoLiA document with provenance data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="folia.xsl"?> <FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="untitled" generator="manual" version="2.0.0"> <metadata type="native"> <annotations> <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/> <paragraph-annotation /> <sentence-annotation /> <token-annotation set="tokconfig-nl"> <annotator processor="p0" /> </token-annotation> <pos-annotation set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn"> <!-- There are multiple annotators, this means that each pos annotation should explicitly refer to one of them using the @processor attribute --> <annotator processor="p1.1" /> <annotator processor="p2.1" /> <annotator processor="p2.2" /> </pos-annotation> <lemma-annotation set="http://ilk.uvt.nl/folia/sets/frog-mblem-nl"> <!-- There is only one annotator so this will be the default, no need to explicitly refer to it from lemma annotations using the @processor attribute --> <annotator processor="p1.2" /> </lemma-annotation> </annotations> <provenance> <processor xml:id="p0" name="ucto" version="0.15" folia_version="2.0" command="ucto -Lnld" host="mhysa" user="proycon" src="https://github.com/LanguageMachines/ucto" begindatetime="2018-09-12T00:00:00" enddatetime="2018-09-12T00:00:00" document_version="1"> <!-- We can add arbitrary meta fields to any processor, they are not defined by FoLiA but application-specific --> <meta id="config">tokconfig-nld</meta> <meta id="language">nld</meta> <processor xml:id="p0.1" name="libfolia" version="2.0" folia_version="2.0" type="generator" /> </processor> <processor xml:id="p1" name="frog" version="0.16" folia_version="2.0" command="frog --skip=pn" host="mhysa" user="proycon" src="https://github.com/LanguageMachines/frog" begindatetime="2018-09-12T00:01:00" enddatetime="2018-09-12T00:02:00" document_version="2"> <processor xml:id="p1.0" name="libfolia" version="2.0" folia_version="2.0" type="generator" /> <processor xml:id="p1.1" name="mbpos" version="0.16"> <processor xml:id="p1.1.1" type="datasource" name="CGN Corpus" version="unknown" /> <processor xml:id="p1.1.2" type="datasource" name="WOTAN Corpus" version="unknown" /> <processor xml:id="p1.1.3" type="datasource" name="DCOI Corpus" version="unknown" /> <processor xml:id="p1.1.4" type="datasource" name="Lassy Klein Corpus" version="unknown" /> </processor> <processor xml:id="p1.2" name="mblem" /> </processor> <processor xml:id="p2" name="flat" version="0.8" folia_version="2.0" host="flat.science.ru.nl" src="https://flat.science.ru.nl" begindatetime="2018-09-12T00:10:00" enddatetime="2018-09-12T00:20:00" document_version="3"> <processor xml:id="p2.0" name="foliapy" version="2.0" folia_version="2.0" type="generator" src="https://github.com/proycon/foliapy" /> <processor xml:id="p2.1" name="proycon" type="manual" /> <processor xml:id="p2.2" name="ko" type="manual" /> </processor> </provenance> </metadata> <text xml:id="untitled.text"> <p xml:id="untitled.p.1"> <s xml:id="untitled.p.1.s.1"> <t>De belastingdienst doet aangifte tegen frauderende mensen.</t> <w xml:id="untitled.p.1.s.1.w.1" class="WORD"> <t>De</t> <pos class="LID(bep,stan,rest)" confidence="0.999701" head="LID" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" processor="p1.1"> <feat class="bep" subset="lwtype"/> <feat class="stan" subset="naamval"/> <feat class="rest" subset="npagr"/> </pos> <lemma class="de"/> </w> <w xml:id="untitled.p.1.s.1.w.2" class="WORD"> <t>belastingdienst</t> <pos class="N(soort,ev,basis,zijd,stan)" confidence="0.998836" head="N" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" processor="p2.1"> <feat class="soort" subset="ntype"/> <feat class="ev" subset="getal"/> <feat class="basis" subset="graad"/> <feat class="zijd" subset="genus"/> <feat class="stan" subset="naamval"/> </pos> <lemma class="belastingdienst"/> </w> <w xml:id="untitled.p.1.s.1.w.3" class="WORD"> <t>doet</t> <pos class="WW(pv,tgw,met-t)" confidence="0.999262" head="WW" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" processor="p1.1"> <feat class="pv" subset="wvorm"/> <feat class="tgw" subset="pvtijd"/> <feat class="met-t" subset="pvagr"/> </pos> <lemma class="doen"/> </w> <w xml:id="untitled.p.1.s.1.w.4" class="WORD"> <t>aangifte</t> <pos class="N(soort,ev,basis,zijd,stan)" confidence="0.998701" head="N" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" processor="p2.2"> <feat class="soort" subset="ntype"/> <feat class="ev" subset="getal"/> <feat class="basis" subset="graad"/> <feat class="zijd" subset="genus"/> <feat class="stan" subset="naamval"/> </pos> <lemma class="aangifte"/> </w> <w xml:id="untitled.p.1.s.1.w.5" class="WORD"> <t>tegen</t> <pos class="VZ(init)" confidence="0.854093" head="VZ" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" processor="p1.1"> <feat class="init" subset="vztype"/> </pos> <lemma class="tegen"/> </w> <w xml:id="untitled.p.1.s.1.w.6" class="WORD"> <t>frauderende</t> <pos class="WW(od,prenom,met-e)" confidence="0.96" head="WW" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" processor="p1.1"> <feat class="od" subset="wvorm"/> <feat class="prenom" subset="positie"/> <feat class="met-e" subset="buiging"/> </pos> <lemma class="frauderen"/> </w> <w xml:id="untitled.p.1.s.1.w.7" class="WORD" space="no"> <t>mensen</t> <pos class="N(soort,mv,basis)" confidence="0.999865" head="N" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" processor="p1.1"> <feat class="soort" subset="ntype"/> <feat class="mv" subset="getal"/> <feat class="basis" subset="graad"/> </pos> <lemma class="mens"/> </w> <w xml:id="untitled.p.1.s.1.w.8" class="PUNCTUATION"> <t>.</t> <pos class="LET()" confidence="1" head="LET" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" processor="p1.1"/> <lemma class="."/> </w> </s> </p> </text> </FoLiA> |
And another more real-life example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="folia.xsl"?> <FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="example.deep" generator="libfolia-v1.5" version="2.0.0"> <metadata type="native"> <annotations> <text-annotation> <annotator processor="p1" /> </text-annotation> <sentence-annotation> <annotator processor="p1" /> </sentence-annotation> <token-annotation set="https://raw.githubusercontent.com/LanguageMachines/uctodata/folia1.4/setdefinitions/tokconfig-nld.foliaset.ttl"> <annotator processor="p2" /> </token-annotation> <pos-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/frog-mbpos-cgn"> <annotator processor="p3.1" /> </pos-annotation> <lemma-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/frog-mblem-nl"> <annotator processor="p3.2" /> </lemma-annotation> </annotations> <provenance> <processor xml:id="p1" name="proycon" type="manual" /> <processor xml:id="p2" name="ucto" version="0.14" /> <processor xml:id="p3" name="frog" version="0.16" begindatetime="2016-11-15T15:12:00"> <processor xml:id="p3.0" name="libfolia" version="1.14" type="generator" /> <processor xml:id="p3.1" name="mbpos" version="1.0" /> <processor xml:id="p3.2" name="mblem" version="1.1" /> </processor> </provenance> <meta id="language">nld</meta> </metadata> <text xml:id="example.deep.text"> <s xml:id="example.deep.p.1.s.1"> <t>De Russen kennen Nova Zembla sinds de 11e of 12e eeuw, toen handelaars van Novgorod het eiland al aandeden.</t> <w xml:id="example.deep.p.1.s.1.w.1" class="WORD"> <t>De</t> <pos class="LID(bep,stan,rest)" confidence="0.779762" head="LID"> <feat class="bep" subset="lwtype"/> <feat class="stan" subset="naamval"/> <feat class="rest" subset="npagr"/> </pos> <lemma class="de"/> </w> <w xml:id="example.deep.p.1.s.1.w.2" class="WORD"> <t>Russen</t> <pos class="SPEC(deeleigen)" confidence="1" head="SPEC"> <feat class="deeleigen" subset="spectype"/> </pos> <lemma class="Russen"/> </w> <w xml:id="example.deep.p.1.s.1.w.3" class="WORD"> <t>kennen</t> <pos class="WW(pv,tgw,mv)" confidence="0.833333" head="WW"> <feat class="pv" subset="wvorm"/> <feat class="tgw" subset="pvtijd"/> <feat class="mv" subset="pvagr"/> </pos> <lemma class="kennen"/> </w> <w xml:id="example.deep.p.1.s.1.w.4" class="WORD"> <t>Nova</t> <pos class="SPEC(deeleigen)" confidence="1" head="SPEC"> <feat class="deeleigen" subset="spectype"/> </pos> <lemma class="Nova"/> </w> <w xml:id="example.deep.p.1.s.1.w.5" class="WORD"> <t>Zembla</t> <pos class="SPEC(deeleigen)" confidence="1" head="SPEC"> <feat class="deeleigen" subset="spectype"/> </pos> <lemma class="Zembla"/> </w> <w xml:id="example.deep.p.1.s.1.w.6" class="WORD"> <t>sinds</t> <pos class="VZ(init)" confidence="0.999078" head="VZ"> <feat class="init" subset="vztype"/> </pos> <lemma class="sinds"/> </w> <w xml:id="example.deep.p.1.s.1.w.7" class="WORD"> <t>de</t> <pos class="LID(bep,stan,rest)" confidence="0.981886" head="LID"> <feat class="bep" subset="lwtype"/> <feat class="stan" subset="naamval"/> <feat class="rest" subset="npagr"/> </pos> <lemma class="de"/> </w> <w xml:id="example.deep.p.1.s.1.w.8" class="NUMBER-ORDINAL"> <t>11e</t> <pos class="TW(rang,prenom,stan)" confidence="0.990632" head="TW"> <feat class="rang" subset="numtype"/> <feat class="prenom" subset="positie"/> <feat class="stan" subset="naamval"/> </pos> <lemma class="11"/> </w> <w xml:id="example.deep.p.1.s.1.w.9" class="WORD"> <t>of</t> <pos class="VG(neven)" confidence="0.855677" head="VG"> <feat class="neven" subset="conjtype"/> </pos> <lemma class="of"/> </w> <w xml:id="example.deep.p.1.s.1.w.10" class="NUMBER-ORDINAL"> <t>12e</t> <pos class="TW(rang,prenom,stan)" confidence="0.990632" head="TW"> <feat class="rang" subset="numtype"/> <feat class="prenom" subset="positie"/> <feat class="stan" subset="naamval"/> </pos> <lemma class="12"/> </w> <w xml:id="example.deep.p.1.s.1.w.11" class="WORD" space="no"> <t>eeuw</t> <pos class="N(soort,ev,basis,zijd,stan)" confidence="0.999633" head="N"> <feat class="soort" subset="ntype"/> <feat class="ev" subset="getal"/> <feat class="basis" subset="graad"/> <feat class="zijd" subset="genus"/> <feat class="stan" subset="naamval"/> </pos> <lemma class="eeuw"/> </w> <w xml:id="example.deep.p.1.s.1.w.12" class="PUNCTUATION"> <t>,</t> <pos class="LET()" confidence="1" head="LET"/> <lemma class=","/> </w> <w xml:id="example.deep.p.1.s.1.w.13" class="WORD"> <t>toen</t> <pos class="VG(onder)" confidence="0.571429" head="VG"> <feat class="onder" subset="conjtype"/> </pos> <lemma class="toen"/> </w> <w xml:id="example.deep.p.1.s.1.w.14" class="WORD"> <t>handelaars</t> <pos class="N(soort,mv,basis)" confidence="0.99944" head="N"> <feat class="soort" subset="ntype"/> <feat class="mv" subset="getal"/> <feat class="basis" subset="graad"/> </pos> <lemma class="handelaar"/> </w> <w xml:id="example.deep.p.1.s.1.w.15" class="WORD"> <t>van</t> <pos class="VZ(init)" confidence="0.999469" head="VZ"> <feat class="init" subset="vztype"/> </pos> <lemma class="van"/> </w> <w xml:id="example.deep.p.1.s.1.w.16" class="WORD"> <t>Novgorod</t> <pos class="SPEC(deeleigen)" confidence="1" head="SPEC"> <feat class="deeleigen" subset="spectype"/> </pos> <lemma class="Novgorod"/> </w> <w xml:id="example.deep.p.1.s.1.w.17" class="WORD"> <t>het</t> <pos class="LID(bep,stan,evon)" confidence="0.996855" head="LID"> <feat class="bep" subset="lwtype"/> <feat class="stan" subset="naamval"/> <feat class="evon" subset="npagr"/> </pos> <lemma class="het"/> </w> <w xml:id="example.deep.p.1.s.1.w.18" class="WORD"> <t>eiland</t> <pos class="N(soort,ev,basis,onz,stan)" confidence="0.996804" head="N"> <feat class="soort" subset="ntype"/> <feat class="ev" subset="getal"/> <feat class="basis" subset="graad"/> <feat class="onz" subset="genus"/> <feat class="stan" subset="naamval"/> </pos> <lemma class="eiland"/> </w> <w xml:id="example.deep.p.1.s.1.w.19" class="WORD"> <t>al</t> <pos class="BW()" confidence="0.90383" head="BW"/> <lemma class="al"/> </w> <w xml:id="example.deep.p.1.s.1.w.20" class="WORD" space="no"> <t>aandeden</t> <pos class="WW(pv,verl,mv)" confidence="0.999559" head="WW"> <feat class="pv" subset="wvorm"/> <feat class="verl" subset="pvtijd"/> <feat class="mv" subset="pvagr"/> </pos> <lemma class="aandoen"/> </w> <w xml:id="example.deep.p.1.s.1.w.21" class="PUNCTUATION"> <t>.</t> <pos class="LET()" confidence="1" head="LET"/> <lemma class="."/> </w> </s> </text> </FoLiA> |
Another example with many annotation types and extensive provenance data: