String Annotation¶
This is a form of higher-order annotation for selecting an arbitrary substring of a text, even untokenised, and allows further forms of higher-order annotation on the substring. It is also tied to a form of text markup annotation.
Specification¶
Annotation Category: | |
---|---|
Declaration: |
|
Version History: | |
since v0.9.1 |
|
Element: |
|
API Class: |
|
Required Attributes: | |
Optional Attributes: | |
|
|
Accepted Data: |
|
Valid Context: |
|
Explanation¶
The <str>
element is available in FoLiA to allow annotations on untokenised substrings. It is a higher-order
annotation element that refers to a substring of the text-content (<t>
) element on the same level, but is specified
outside from it.
Explicitly denoting substrings in this fashion is needed when you want to associate further annotations with a substring. Consider the following example:
<p xml:id="example.p.1">
<t>Hello. This is a sentence. Bye!</t>
<str xml:id="example.p.1.str.1">
<t offset="0">Hello</t>
<desc>This is a word of greeting</desc>
</str>
</p>
In substrings, using an offset attribute on the text-content element enables substrings to be properly positioned with respect to their parent text.
The <str>
element has a text markup (Text Markup Annotation) counterpart called <t-str>
. Both
share the same declaration. The text markup variant can be used in the scope of the text content itself and may be more intuitive, but it is also less flexible, as it does not allow further annotations in its scope and can not be used when substrings are overlapping, unlike <str>
. Consider the following example:
<p xml:id="example.p.1">
<t><t-str id="example.p.1.str.1">Hello</t-str>. This is a sentence. Bye!</t>
<str xml:id="example.p.1.str.1">
<t offset="0">Hello</t>
<desc>This is a word of greeting</desc>
</str>
</p>
In the above example, the id
parameter (distinct from xml:id
!) on <t-str>
is a reference to the <str>
element, showing how the two elements can be used in combination.
One of the features of <str>
is that you can put Inline Annotation in its scope, so you can
associate e.g. PoS tags and lemmas with substrings in special cases where you might need to do this. Do note that this is
NOT a substitute or alternative for proper tokenisation (Token Annotation), nor Morphological Annotation!
String elements are a form of higher-order annotation, they are similar to structure annotation but carry several
distinct properties. Unlike structure elements, substring order does not matter and substrings may overlap. The
difference between Token Annotation (<w>
) and string annotation (<str>
) has to be clearly understood, the
former refers to actual tokens and supports further token annotation, the latter to untokenised or differently tokenised
substrings.The
Of course, the <str>
elements themselves may carry a class, associated with a user-defined set.
Textclasses (advanced)¶
If you are familiar with Text classes (advanced), then it is good to know that this principle of course extends to within substrings as well. Consider the following example with three text layers, from each of them the same substring has been extracted:
<p xml:id="example.p.1">
<t>Hello. This is a sentence. Bye!</t>
<t class="normalised">Hello. This iz a sentence. Bye!</t>
<t class="ocroutput">Hell0 Th1s iz a sentence, Bye1</t>
<str xml:id="example.p.1.str.1">
<t class="ocroutput" offset="0">Hell0</t>
</str>
<str xml:id="example.p.1.str.2">
<t class="normalised" offset="0">Hello.</t>
</str>
<str xml:id="example.p.1.str.3">
<t offset="0">Hello.</t>
</str>
</p>
Instead of three separate substrings, we can also opt for a single one. Which solution is right for you depends on your own use case:
<p xml:id="example.p.1">
<t>Hello. This is a sentence. Bye!</t>
<t class="normalised">Hello. This iz a sentence. Bye!</t>
<t class="ocroutput">Hell0 Th1s iz a sentence, Bye1</t>
<str xml:id="example.p.1.str.1">
<t class="ocroutput" offset="0">Hell0</t>
<t class="normalised" offset="0">Hello</t>
<t offset="0">Hello.</t>
</str>
</p>
Or, if you do want separate strings but you also want to make the relation between them very explicit, then you can resort to Relation Annotation as shown in the next example:
<p xml:id="example.p.1">
<t>Hello. This is a sentence. Bye!</t>
<t class="ocroutput">Hell0 Th1s iz a sentence, Bye1</t>
<str xml:id="example.p.1.str.1">
<t class="ocroutput" offset="0">Hell0</t>
<alignment>
<aref id="example.p.1.str.2" type="str" />
</alignment>
</str>
<str xml:id="example.p.1.str.2">
<t offset="0">Hello.</t>
<alignment>
<aref id="example.p.1.str.1" type="str" />
</alignment>
</str>
</p>
The <str>
element is powerful when combined with alignments, as this allows the user to
relate multiple alternative (pseudo-)tokenisations. This is also the limit as to what you can do with differing tokenisations in
FoLiA, as FoLiA only supports one authoritative tokenisation.
Example¶
The following examples combines various aspects discussed in this section:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | <?xml version="1.0" encoding="utf-8"?> <FoLiA xmlns="http://ilk.uvt.nl/folia" version="2.0" xml:id="example"> <metadata> <annotations> <text-annotation> <annotator processor="p1" /> </text-annotation> <paragraph-annotation> <annotator processor="p1" /> </paragraph-annotation> <string-annotation> <annotator processor="p1" /> </string-annotation> <relation-annotation> <annotator processor="p1" /> </relation-annotation> </annotations> <provenance> <processor xml:id="p1" name="proycon" type="manual" /> </provenance> </metadata> <text xml:id="example.text"> <p xml:id="example.p.1"> <t><t-str id="example.p.1.str.1">Hello.</t-str> This is a sentence. Bye!</t> <t class="ocroutput"><t-str id="example.p.1.str.2">Hell0</t-str> Th1s iz a sentence, Bye1</t> <str xml:id="example.p.1.str.1"> <t offset="0">Hello.</t> <relation> <xref id="example.p.1.str.2" type="str" /> </relation> </str> <str xml:id="example.p.1.str.2"> <t class="ocroutput" offset="0">Hell0</t> <relation> <xref id="example.p.1.str.1" type="str" /> </relation> </str> </p> </text> </FoLiA> |