Speech¶

FoLiA is also suited for annotation of speech data. The following additional generic FoLiA attributes are available for all structure annotation elements in a speech context:

src – Points to a file or full URL of a sound or video file. This attribute is inheritable.
begintime – A timestamp in HH:MM:SS.MMM format, indicating the begin time of the speech. If a sound clip is specified (src); the timestamp refers to a location in the soundclip.
endtime – A timestamp in HH:MM:SS.MMM format, indicating the end time of the speech. If a sound clip is specified (src); the timestamp refers to a location in the soundclip.
speaker – A string identifying the speaker. This attribute is inheritable. Multiple speakers are not allowed, simply do not specify a speaker on a certain level if you are unable to link the speech to a specific (single) speaker.

Speech generally asks for a different document structure than text documents. The top-level element for speech-centred resources is speech, rather than text. Most elements described in the section on text structure may be used under speech as well; such as Division Annotation, Sentence Annotation, Token Annotation. Notions such as paragraphs, tables and figures make less sense in a speech context.

In a speech context, you can use Utterance Annotation as an alternative or complement to Sentence Annotation, as it is often more logical to segment speech into utterances than grammatically sound sentences.

For non-speech events, you can use Event Annotation. Consider the following small example, with speech-context attributes associated:

<event class="cough" src="soundclip.mp3" begintime="..." endtime="..." />

If you want to associate timing information and the use of begintime and endtime on structural elements is insufficient for your needs, then look into Time Segmentation.

Speech has its counterpart to text, in the form of a phonetic or phonological transcription, i.e. a representation of the speech as it was pronounced/recorded. FoLiA has a separate content element for this, see Phonetic Annotation/Content. You should still use the normal Text Annotation for a normal textual transcription of the speech.

For further segmentation of speech into phonemes, you can use Phonological Annotation.

Example¶

An example of a simple speech document:

<?xml version="1.0" encoding="utf-8"?>
<FoLiA xmlns="http://ilk.uvt.nl/folia" version="2.0" xml:id="example">
  <metadata>
      <annotations>
          <phon-annotation>
			 <annotator processor="p1" />
          </phon-annotation>
          <utterance-annotation>
			 <annotator processor="p1" />
          </utterance-annotation>
          <token-annotation>
			 <annotator processor="p1" />
          </token-annotation>
      </annotations>
      <provenance>
         <processor xml:id="p1" name="proycon" type="manual" />
      </provenance>
  </metadata>
  <speech xml:id="example.speech">
    <utt xml:id="example.utt.1" src="helloworld.mp3"  begintime="00:00:01.000" endtime="00:00:02:000">
        <ph>helˈoʊ wɝːld</ph>
        <w xml:id="example.utt.1.w.1" begintime="00:00:00.000" endtime="00:00:01.000">
            <ph>helˈoʊ</ph>
        </w>
        <w xml:id="example.utt.1.w.2" begintime="00:00:01.000" endtime="00:00:02.000">
            <ph>wɝːld</ph>
        </w>
    </utt>
  </speech>
</FoLiA>