Tokenization of published texts
This document is a placeholder for more detailed documentation.
The automated build process that compiles archival editions of texts for publication automatically tokenizes each text. Tokens are classified as one of “lexical”, “numeric”, “named entity”, “literal string” or “label”. Tokens can then be cited with a URN value reflecting their class. The build process creates an index of occurrences of tokens that is published as RDF statements citing tokens with their CITE Collection URNs and the passages where they occur with CTS URNs.
The HMT projects uses the hocus-pocus library’s
HmtGreekPoetry class to tokenize texts for morphological analysis. This class considers XML markup as well as character values in order to tokenize texts. It recursively looks at XML elements, and takes account of markup that either determines the classification of tokens (e.g., contents of TEI
num elements are classified as “numeric” tokens), or grouping of tokens (e.g., TEI
w element groups a single token together, no matter further interior markup it contains). It further tokenizes text nodes on white space to determine a CTS URN (including subreference with explicit index number) identifying a unique passage of text containing a single token; it then eliminates punctuation
Citation of tokens
Each token belongs to a CITE Collection. Its unique String value, as determined by the algorithm described previously, is the object identifier; the namespace and collection identifiers for each group are:
- lexical tokens:
- numeric tokens:
- named entity tokens:
- literal string tokens:
- label tokens:
Beginning with the release of 2012.8.12, the publication of archival texts automatically includes a zip file with a brief README file and the RDF index of tokens. You can find the latest release here, or use the maven coordinates