Some Thoughts on Implementing UEB Capitalization

(This article is based on my own understanding of Section 8: Capitalisation of the Rules of Unified English Braille and has not been approved by any braille authority.)

Introduction. I’ve been following the “word_reset as opcode” thread on the liblouis list and was interested in a post explaining that “markEmphases() is call[ed] before the main translation loop is started. It goes through the input and the typebuf and marks the beginning and ending of all the emphases (this includes capitalization).”

This explanation got me to thinking how I would approach the much simpler problem of marking a non-technical print source by hand so as to support UEB capitalization. When this project turned out to be considerably more complex than I’d anticipated, I decided to write up my approach in case the information could turn out to be useful to someone else dealing with these rules. One of the complexities, which is detailed below, is the need to find out whether certain items are part of a capitals passage before being able to determine which capitalization method should be applied to them.

Text Analysis. Text analysis of a print source requires the identification of items and of separators. An item is a sequence of characters that does not include separators but is delimited by them. (I believe that my term item has the basically same meaning for print as what the UEB Rules refer to as a “symbols-sequence” for braille.) Typically only two kinds of separators are common in English text: whitespace and the dash character. Whitespace refers to the standard computer definition which includes actual spaces, tabs, the start and end of a file, etc. The dash character is the Unicode em dash which has the Unicode character code hex 2014.

An item has an optional prefix and an optional postfix and is thus is not necessarily identical to a word, number, alphanumeric, etc. Prefixes and postfixes in marked-up print text generally consist of punctuation and/or tags indicating style changes. This article uses the term main portion to reference that part of an item that doesn’t include any prefix or postfix.

Some relevant background may be found in write-ups for the Java BreakIterator class which incorporates a locale-dependent implementation of processes for separating words from their prefixes and postfixes in natural language text. (This class is not appropriate for marked-up text such as XML files.) Here’s a link to one such write-up: Java tutorial for isolating words.

Further discussion of processes for word separation in print-to-braille translators for marked-up text is outside the scope of this article. An automated print-to-braille translator would require some means of distinguishing an item’s prefix and postfix since that is a prerequisite to handling UEB capitalization and other translation issues.

General Discussion. My goal was simply to think about out how to analyze and manually mark a print text so as to show where its UEB braille translation would require the use of UEB capitalization and also to show which UEB capitalization method would be required in a particular context. I did not consider the need for any other indicators.

The analysis of a single text element (or block-level element) at a time is sufficent for applying capitalization in most situations. (UEB Rule 8.5.5 defines text element as either a paragraph or list element.) However, per the example for Rule 8.5.5 along with the following two Rules, which apply to “a capitalized passage which extends over more than one text element,” the possibility of an extended capitalized passage requires additional analysis. This issue is discussed below in the section titled Identifying Continuous Capitalized Passages.

Where the Capitals Indicators Are Located. There are several UEB Rules that affect where the capitals passage indicators should be inserted. Per Rule 8.6.2 “it is best” practice that where paired indicators such as the capitals passage indicator and its terminator are associated with other paired indicators and/or with paired characters, these items should be nested such that the “close punctuation and indicators [are] in reverse order of opening.”

According to Rule 8.7.1 in the UEB rule book a dot 6 prefix, a capitals word indicator, and a capitals passage indicator are always “placed immediately before” the braille translation of the first affected print letter. (This means that the prefix or indicator is placed before any modifier or ligature indicator needed to translate the print letter to its braille equivalent but after the translation of any prefix that precedes the print letter.) Note that while this prescription must always be followed, per Rule 8.6.2 there is some flexibility the placement of a capitals passage terminator as it is in the case of Rule 8.5.5 for continuous capitalized passages where the terminator is place "only at the end of the final text element".

See UEB rule book section 8.8 Choice of capitalised indicators for additional information.

Pass 1. Preliminary Identification. Since some of the UEB capitalization rules depend on context, implementation of these rules may require more than one pass. My idea is that first pass should identify seven different types of print items based on those characteristics of their main portions that need to taken into account in order to implement the capitalization rules for a single text element. The first three types are never included in a capitals passage but the others can be included in appropriate contexts. (Identification of additional characteristics required to address the need for numerical, grade 1 and/or other indicators is not addressed. However, such identification would generally need to be done in conjunction with that for capitalization.)

LC or lowercase. An item where the main portion contains at least one letter and all of its letters are lower case. (Note that particular identification doesn’t ned to be explicitly identified if it is taken as the default.)
TC or title case. An item where the first character of the main portion is an uppercase letter, the main portion includes at least one lower case letter and any other letters are also lower case. The main portion may include one or more characters which are not letters. (Title case items don’t require an additional pass prior to applying capitalization.)
MC or mixed case. An item where the main portion has at least one lower case and one upper case letter but doesn’t meet the definition of TC. The main portion may include one or more characters which are not letters. (Mixed case items don’t require an additional pass prior to applying capitalization.)
UCX or upper-case compatible. An item where none of the characters are letters.
UCN1 or upper-case plus non-alphabetic symbols. An item where all of the main portion’s letters are uppercase. The main portion must include one or more characters which are not letters, i.e. which are “non-alphabetic symbols”. The first character of the main portion must be an upper-case letter.
UCN or upper-case plus non-alphabetic symbols. An item where all of the main portion’s letters are uppercase but which doesn’t meet the definition of UCN1. (The distinction of UCN1 from UCN is necessary to avoid a capitals passage starting with a number or other non-alphabetic character.) The main portion must include one or more characters which are not letters, i.e. which are “non-alphabetic symbols”.
UC Upper case. An item where all of the main portion’s characters are upper-case letters.

Pass 2. Marking A Capitals Passage. Capitals passages need to be identified first if there are any UC or UCN portions in a text element since their associated items must be included in a capitals passage wherever possible.

According to UEB Rule 8.5.2 a capitalized passage “is three or more symbols-sequences and it may include non-alphabetic symbols.” In considering the examples in the UEB rule book, it appears that the word-like portions of the first items of a passage should be either UC or UCN1 and last items or symbols-sequences of a passage should be one of UC, UCN1, or UCN. The remaining ones can be any of UC, UCN1, UCN or UCX. [My interpretation of the rule-book examples is that the first portion in a passage must start with an upper-case letter and the last item in a passage must have at least one upper-case letter.] This prescription supplies the necessary information for marking the insertion points in the print source of a non-continuous capitalized passage. However, for later convenience in implementing the special placement of the capitals terminator per Rules 8.5.5-8.5.7 for continuous capitalized passages, it is useful to determine during the first pass whether or not an identified capitalized passage comprises an entire text element and could thus possibly turn out to be part of a continuous capitalized passage. Note that per 8.5.6 this will never be the case when a text element is a heading or similar item.

If it is desired to automate the Rule 8.8 “best practice” nesting, which was described earlier, then the passage algorithm needs to do more than count the number of symbols-sequences. For example, if a capitalized passage includes a paired punctuation symbol it should include both members of the pair. (See the first example in Rule 8.5.4.) Care must also be taken when the source includes emphasis. For example, were the tagged sample <b>BOLD CAPS EMPHASIS</b>, <i>ITALICS CAPS EMPHASIS</i> to be considered as a single capitalized passage, its capitalized passage indicator and typeform passage indicators could not be properly nested. In addition to ensuring nesting, the UEB rule book suggests that a capitalized passage should be comprised of only those items that are naturally part of a single passage. See Rules 8.5.4 and 8.5.6 for more information.

Identifying Continuous Capitalized Passages. The distinction between a non-continuous capitalized passage and two or more continuous capitalized passages cannot generally be determined until sufficient text has been analyzed. However, this determination has to be completed prior to insertion of any capitalization terminators since per UEB Rule 8.5.5 the use and placement of the capitals termination indicator differ in the two situations.

Last Pass. Marking the Use of Capital Letters and of the Capitals Word Indicator. Note that as pointed out earlier, some of these items may (also) require numeric and/or Grade 1 indicators.)

LC items and, of course, any UCX items that were not found to be part of a capitalized passage don’t require capitalization.

The main portion of a TC item requires a leading braille capital letter.

The main portions of UC items which were not found to be part of a capitalized passage require either a leading braille capital letter or capitals word indicator. (Note that most single letters also require a Grade 1 letter indicator.)

The main portions of MC items and of any UCN1 or UCN items which were not found to be part of a capitalized passage must be addressed case-by-case according to the rules for the use of capital letters and the capitals word indicator. In general an isolated capital letter is translated as a capital letter whereas a sequence of two or more capital letters is preceded by a capitals word indicator. The capitals word indicator usually doesn’t require an explicit terminator in the main portions of these mixed items as it is terminated implicitly by any non-letter character, by whitespace, or by a single capital letter. Rule 8.4 includes the example “TVOntario” where the capital letter O has been translated as a single capital letter in order to make the braille translation more readable and also to avoid the need for an explicit capitals terminator. (See UEB rule book 8.8 Choice of capitalised indicators for additional information on handling mixed items).

Conclusion. Applying the UEB capitalization rules seems complex to me. This project made me think that automating these rules along with the related ones for UEB typeforms and for numerical and grade 1 modes would be difficult even in a one-off implementation restricted to UEB.