This section collects guidelines, tips, do’s and don’ts and conventions in dealing with FoLiA documents.
For data creators/publishers¶
- Always validate all FoLiA documents you create and intend to publish!. Use one of the official validation tools
folialint). See Validation. This will already catch most of the issues that could arise out of not following these guidelines.
- Never invent custom XML elements and attributes. If you really must, make sure they are in a different XML namespace. See Foreign Annotation.
- If you want to encode something and FoLiA does not seem to offer a good solution yet, or if you are simply unsure whether the solution you want to use is appropriate, contact the FoLiA developers on our Issue tracker. FoLiA can be extended in collaboration. Do not simply add your own elements/attributes.
- Mind the sets you use. Creating and publishing set definitions is recommended but not strictly mandatory for most uses. See Set Definitions (Vocabulary)
- Identifiers should never change: Once you assign an identifier to something and publish your data: do not change any identifier that is in use.
- All annotation types you use must be declared, see Annotation Declarations. Take care not to declare annotation types that you don’t actually use in your document unless you have good reason to believe the annotation type will be added soon.
- Using a high-level FoLiA programming library, if available for your programming language, is strongly recommended over parsing/writing/querying the XML yourself, as it will make things a lot easier and save a lot of work!
- Always use the latest version of FoLiA and its libraries.
- Mind the sets you use. Actively check whether the sets uses in a document are in fact the ones your software handles, i.e. check the declarations (see Annotation Declarations). For example, do not blindly assume any part-of-speech tag will be in your intended vocabulary. See Set Definitions (Vocabulary)
- Considering that FoLiA is vast, it is fine to only support a subset of a certain annotation types in your software, or not to support certain complexities such as Correction Annotation. Just make sure to check the declarations based on which you can reject processing a document.
- The structure of a text as represented in FoLiA documents can differ greatly between documents, as different types of documents (books,articles,papers,poetry,etc..) are structured differently. The annotation declaration in the metadata tell you what structural types you can encounter, but they don’t convey precisely how these structures are nested. Unless you have very good reason to do so, do NOT assume your documents are neatly subdivided into e.g. only paragraphs and sentences. There may be lists, figures, divisions. Generally spoken, you’ll often want to descend into the deepest structural nodes that have text. The FoLiA libraries provide a high-level API for you to do this.
- If you don’t use a FoLiA library, you may want to consider accepting only FoLiA documents in so-called explicit form (see Form). Explicit form does not use any implicit defaults but makes everything explicit in the XML. This means the logic in your parser can be kept less complicated. You can turn any explicit form document into a normal form one and vice versa (without loss). If you get a normal form document (which is the norm), run an external tool like
foliavalidator --explicitto turn it into explicit form before parsing it. It’s strongly recommended not to shift this burden to the user as he/she may be confused by it.
Conventions are good practices that you will encounter and are encouraged to follow, but they remain just conventions rather than strict guidelines.
- Most FoLiA software assigns verbose identifiers for all elements. We use the the ID of the FoLiA
document as the base identifier and then append the element type and sequence number, all delimited by dots. The IDs
are cumulative in nature, so we get for instance
example.p.1.s.2.w.3for the third word in the second sentence in the first paragraph of the document with ID
example. See Identifiers
- Adding metadata to your document is always encouraged.