Gap Annotation

Sometimes there are parts of a document you want to skip and not annotate at all, but include as is. This is where gap annotation comes in, the user-defined set may indicate the kind of gap. Common omissions in books are for example front-matter and back-matter, i.e. the cover.

Specification

Structure Element

Annotation Category:
 

Higher-order Annotation

Declaration:

<gap-annotation set="..."> (note: set is optional for this annotation type; if you declare this annotation type to be setless you can not assign classes)

Version History:
 

Since the beginning

Element:

<gap>

API Class:

Gap (FoLiApy API Reference)

Required Attributes:
 
Optional Attributes:
 
  • xml:id – The ID of the element; this has to be a unique in the entire document or collection of documents (corpus). All identifiers in FoLiA are of the XML NCName datatype, which roughly means it is a unique string that has to start with a letter (not a number or symbol), may contain numbers, but may never contain colons or spaces. FoLiA does not define any naming convention for IDs.
  • set – The set of the element, ideally a URI linking to a set definition (see Set Definitions (Vocabulary)) or otherwise a uniquely identifying string. The set must be referred to also in the Annotation Declarations for this annotation type.
  • class – The class of the annotation, i.e. the annotation tag in the vocabulary defined by set.
  • processor – This refers to the ID of a processor in the Provenance Data. The processor in turn defines exactly who or what was the annotator of the annotation.
  • annotator – This is an older alternative to the processor attribute, without support for full provenance. The annotator attribute simply refers to the name o ID of the system or human annotator that made the annotation.
  • annotatortype – This is an older alternative to the processor attribute, without support for full provenance. It is used together with annotator and specific the type of the annotator, either manual for human annotators or auto for automated systems.
  • datetime – The date and time when this annotation was recorded, the format is YYYY-MM-DDThh:mm:ss (note the literal T in the middle to separate date from time), as per the XSD Datetime data type.
  • n – A number in a sequence, corresponding to a number in the original document, for example chapter numbers, section numbers, list item numbers. This this not have to be an actual number but other sequence identifiers are also possible (think alphanumeric characters or roman numerals).
  • src – Points to a file or full URL of a sound or video file. This attribute is inheritable.
  • begintime – A timestamp in HH:MM:SS.MMM format, indicating the begin time of the speech. If a sound clip is specified (src); the timestamp refers to a location in the soundclip.
  • endtime – A timestamp in HH:MM:SS.MMM format, indicating the end time of the speech. If a sound clip is specified (src); the timestamp refers to a location in the soundclip.
  • tag – Contains a space separated list of processing tags associated with the element. A processing tag carries arbitrary user-defined information that may aid in processing a document. It may carry cues on how a specific tool should treat a specific element. The tag vocabulary is specific to the tool that processes the document. Tags carry no instrinsic meaning for the data representation and should not be used except to inform/aid processors in their task. Processors are encouraged to clean up the tags they use. Ideally, published FoLiA documents at the end of a processing pipeline carry no further tags. For encoding actual data, use class and optionally features instead.
Accepted Data:

<comment> (Comment Annotation), <content> (Raw Content), <desc> (Description Annotation), <metric> (Metric Annotation), <part> (Part Annotation)

Valid Context:

<div> (Division Annotation), <event> (Event Annotation), <head> (Head Annotation), <p> (Paragraph Annotation), <quote> (Quote Annotation), <s> (Sentence Annotation), <term> (Term Annotation), <utt> (Utterance Annotation)

Text markup Element

Element:

<t-gap>

API Class:

TextMarkupGap (FoLiApy API Reference)

Required Attributes:
 
Optional Attributes:
 
  • xml:id – The ID of the element; this has to be a unique in the entire document or collection of documents (corpus). All identifiers in FoLiA are of the XML NCName datatype, which roughly means it is a unique string that has to start with a letter (not a number or symbol), may contain numbers, but may never contain colons or spaces. FoLiA does not define any naming convention for IDs.
  • set – The set of the element, ideally a URI linking to a set definition (see Set Definitions (Vocabulary)) or otherwise a uniquely identifying string. The set must be referred to also in the Annotation Declarations for this annotation type.
  • class – The class of the annotation, i.e. the annotation tag in the vocabulary defined by set.
  • processor – This refers to the ID of a processor in the Provenance Data. The processor in turn defines exactly who or what was the annotator of the annotation.
  • annotator – This is an older alternative to the processor attribute, without support for full provenance. The annotator attribute simply refers to the name o ID of the system or human annotator that made the annotation.
  • annotatortype – This is an older alternative to the processor attribute, without support for full provenance. It is used together with annotator and specific the type of the annotator, either manual for human annotators or auto for automated systems.
  • confidence – A floating point value between zero and one; expresses the confidence the annotator places in his annotation.
  • datetime – The date and time when this annotation was recorded, the format is YYYY-MM-DDThh:mm:ss (note the literal T in the middle to separate date from time), as per the XSD Datetime data type.
  • n – A number in a sequence, corresponding to a number in the original document, for example chapter numbers, section numbers, list item numbers. This this not have to be an actual number but other sequence identifiers are also possible (think alphanumeric characters or roman numerals).
  • src – Points to a file or full URL of a sound or video file. This attribute is inheritable.
  • begintime – A timestamp in HH:MM:SS.MMM format, indicating the begin time of the speech. If a sound clip is specified (src); the timestamp refers to a location in the soundclip.
  • endtime – A timestamp in HH:MM:SS.MMM format, indicating the end time of the speech. If a sound clip is specified (src); the timestamp refers to a location in the soundclip.
  • speaker – A string identifying the speaker. This attribute is inheritable. Multiple speakers are not allowed, simply do not specify a speaker on a certain level if you are unable to link the speech to a specific (single) speaker.
  • tag – Contains a space separated list of processing tags associated with the element. A processing tag carries arbitrary user-defined information that may aid in processing a document. It may carry cues on how a specific tool should treat a specific element. The tag vocabulary is specific to the tool that processes the document. Tags carry no instrinsic meaning for the data representation and should not be used except to inform/aid processors in their task. Processors are encouraged to clean up the tags they use. Ideally, published FoLiA documents at the end of a processing pipeline carry no further tags. For encoding actual data, use class and optionally features instead.
  • xlink:href – Turns this element into a hyperlink to the specified URL
  • xlink:type – The type of link (you’ll want to use simple in almost all cases).
Accepted Data:

<comment> (Comment Annotation), <desc> (Description Annotation), <br> (Linebreak)

Valid Context:

Explanation

Sometimes there are parts of a document you want to skip and not annotate, but include as is. For this purpose the <gap> element should be used. Gaps may have a particular class indicating the kind of gap it is, defined by a user-defined set. Common omissions are for example front-matter and back-matter, text that is illegible/inaudible or in a foreign language. Again, the semantics depend on your set.

Although a gap skips over content, you may still want to explicitly add the raw content, this is done with the <content> element (see Raw Content). As this concerns raw content, it can not be annotated any further and we use XML CDATA type here to include it verbatim.

The following example shows the the use of <gap>:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
<?xml version="1.0" encoding="utf-8"?>
<FoLiA xmlns="http://ilk.uvt.nl/folia" version="2.0" xml:id="example">
  <metadata>
      <annotations>
          <text-annotation>
			 <annotator processor="p1" />
          </text-annotation>
          <division-annotation set="https://raw.githubusercontent.com/LanguageMachines/uctodata/master/setdefinitions/divisions.foliaset.xml">
			 <annotator processor="p1" />
		  </division-annotation>
          <gap-annotation set="adhoc">
			 <annotator processor="p1" />
		  </gap-annotation>
          <rawcontent-annotation>
			 <annotator processor="p1" />
		  </rawcontent-annotation>
          <description-annotation>
			 <annotator processor="p1" />
		  </description-annotation>
          <paragraph-annotation>
			 <annotator processor="p1" />
		  </paragraph-annotation>
      </annotations>
      <provenance>
         <processor xml:id="p1" name="proycon" type="manual" />
      </provenance>
  </metadata>
  <text xml:id="example.text">
     <gap class="frontmatter">
        <desc>This is the cover of the book</desc>
        <content>
<![CDATA[

            SHOW WHITE AND THE SEVEN DWARFS


                by the Brothers Grimm

                    first edition


            Copyright(c) blah blah
]]>
        </content>
     </gap>
     <div xml:id="example.div.1" class="chapter" n="1">
         <t>In the <t-gap class="illegible" /> there was a princess...</t>
     </div>
  </text>
</FoLiA>

The gap element comes in two flavours, there is not just the aforementioned structural elements but there is also a text markup element (see Text Markup Annotation). This is the text markup element <t-gap> and it offers a more fine-grained variant for use in untokenised text. It indicates a gap in the textual content and is also shown in the above example. Either text is not available or there is a deliberate blank for, for example, fill-in exercises. It is recommended to provide a textual value when possible, but this is not required.

If you find that you want to mark your whole text content as being a <t-gap>, then this is a sure sign you should use the structural element <gap> instead.

Note

Both elements are the same annotation type so share the same declaration.

Text Redundancy

In cases of text redundancy (see Text Annotation), the <t-gap> element may take an ID reference attribute that refers to a gap element, as shown in the following example:

<s>
  <t>to <t-gap id="gap.1" class="fillin">be</t-gap> or not to be</t>
  <w><t>to</t></w>
  <gap xml:id="gap.1" class="fillin"><content>be</content></gap>
  <w><t>or</t></w>
  <w><t>not</t></w>
  <w><t>to</t></w>
  <w><t>be</t></w>
</s>