Executive Summary. There are no current applications that provide fully automated conversion of print to braille. Converting print to braille requires both formatting and translating. However, the major unsolved issue for automated conversion of non-technical print to braille is formatting. Braille-in-DAISY is most likely to succeed if it
The first commercial software for converting or transcribing printed documents to braille was developed for the IBM 709 in 1965. (The IBM 709 had a 32K 36-bit word memory.) This software translated print to contracted American English braille and even produced printed prooflists by "retranslating" the braille to print. Nonetheless, despite the many advances in text processing that have occurred in the ensuing 40-some years, there are still no software applications capable of providing fully-automated conversion of general print documents to properly-formatted braille.
Now the new Braille-in-Daisy project has the goal of solving the problem of automated conversion of print to braille.
This project aims to improve the efficiency of Braille print production from DTBook by:
This should be accomplished by the end of 2007.
- identifying and documenting the editing requirements of Braille
- providing a vocabulary for Braille editing in DTBook
- creating a printer independent, non-proprietary and universal "embosser ready" Braille format
- providing a framework for, and a basic implementation of, a fully automated conversion between DTBook (including the Braille editing vocabulary) and the "embosser ready" Braille format
It may help the Braille-in-Daisy project to succeed if we consider why fully automated conversion of print to braille is not already a reality.
Many proprietary and non-proprietary braille transcribing applications have been developed in the last 40 years. Their developers have clearly had the goal of producing accurate braille with greater automation. It is appropriate for us to look back and consider this history before moving forward with yet another braille transcribing application.
The following is my understanding of the five main problems facing automated braille transcription today:
Addressing these problems in turn we note:
Conclusion. The biggest areas of risk are items 1. and 4. The Braille-in-DAISY project will succeed to the extent that its members understand the unique difficulties of braille formatting and are able to make use of the new approaches made possible by current hardware and software environments.
Formatting refers to issues in braille transcribing related to the representation of the layout and styling of print documents. There are two types of braille formatting: direct document formatting and indirect character formatting.
Document formatting refers to any formatting of braille documents that is implemented directly by rendering including centering, indenting, line breaks, and table layout.
Braille has special formats for many different types of documents and document elements including cartoons, cross references, columnar material, directions, exercises, footnotes, glossaries, hyperlinks, indexes, lists, plays, poems, stage directions, tables, and tables of contents.
Some aspects of braille document formatting are simply alternate renderings of their print analogues. In these cases, the correct braille formatting can be generated automatically if the print documents use semantic tagging for document elements. However, proper braille document formatting also requires addressing difficult braille-specific issues.
A detailed description of braille formatting rules is beyond the scope of this article. The formatting rules for American English Braille are online at brl.org. However, two examples are included to illustrate the general nature of the difficulties. One problem area is the need to include formatting information that reflects the structure of the rendered print document being transcribed. A second problem area is in the handling of tables.
When a print document is transcribed to braille, there is often a requirement that the formatting of the braille document reflect certain aspects of the print document. For example, print page numbers as well as braille page numbers must often be included.
A second case where the reference print document affects the formatting of the braille occurs with poetry. Here, not surprisingly, the relative indentation of the poetic lines in braille is the same as that of the print. However, braille lines typically contain less information than print lines so a poetic line in braille may have to be run over to the next line even though the print is a single line. There is a special rule for such runovers: if a line of poetry runs over in braille, it must be indented "two cells to the right of the beginning of the farthest indented poetic line in the entire poem." In other words, the indentation of a runover line in a braille transcription of a poem requires knowledge of the indentation structure of the entire printed poem.
The formatting of tables in braille must be done according to strict rules so as to ensure readability. These rules include changes to the table structure as necessary to accommodate the limited width of a braille page.
The American English braille rules for formatting tables are available online. Some of the major considerations for transcribing tables include the division of long tables, reformatting using a stairstep model and/or inverting rows and columns, and abbreviating the text of column and row headers to save space. Some of the other differences from print include the placement of captions, the omission of outer lines, and the use of guide dots. Many of these changes require documentation in the form of transcriber's notes.
Tables can present additional difficulties for contracted braille codes in cases where generating the layout requires knowing the size of table entries. This is because the number of braille cells required to translate a print item to contracted braille is often different from the number of characters in the print item.
Character formatting refers to inline formatting or styling such as italics that is implemented indirectly with the use of embedded markup. Braille symbols used as markup are typically referred to as composition indicators.
Transcribing of character formatting presents a number of interesting problems over and above simply translating certain XML markup tags to braille indicators.
Scoping rules can be difficult to implement. The scope of braille composition indicators are typically implicit: explicit end tags are used only when the implicit scope is not what is intended. Also, the rules for the use of certain indicators depends on the number of affected items. For example, if three or less words are italicized, each word is italicized with its own composition indicator. However, a different mechanism is used to indicate italicized passages of more than three words.
Braille character formatting sometimes follows print rendering and sometimes follows print semantics. As an example of character formatting that depends on semantics, braille uses different composition indicators for titles and for symbols even though both items are typically styled similarly in print.
Braille has many fewer styling alternatives than print and different styles of print highlighting are not usually distinguished. In fact, print styling that is primarily decorative is often not indicated at all in braille. However, difficulties can occur such as when the print text makes explicit reference to a particular rendering, e.g. "Students should memorize all definitions shown in blue type."
Translating refers to replacing print characters or sequences of print characters by those braille cells that represent the characters plus adding special braille indicators used as necessary to mark changes in the semantics or meanings of the braille cells. There are two types of braille systems: uncontracted and contracted.
Uncontracted braille replaces each individual print character with the corresponding single-cell or multi-cell braille symbol. The replacement rules for uncontracted (Grade 1) braille are context-free. Automated translation of print to uncontracted braille is straightforward and is not further discussed in this article.
Contracted braille extends uncontracted braille's replacement rules for print characters with additional rules for replacing common print words and other common sequences of print letters with special braille symbols known as contractions. Many of the replacement rules for contracted (Grade 2) braille are context-sensitive. Contracted English braille has approximately 300 replacement rules which are described in the next section. Automated translation of print text to contracted braille is complicated by the language-dependent restrictions for the use and non-use of contractions which are also described in the next section.
Following the description of contracted braille are discussions of two alternative approaches for automated translation of print to contracted braille: dictionary-based and rule-based.
[In addition to the particular and general rules, braille systems also have requirements on the use of certain contractions that depend on either the semantics or syntax of the item being translated. For example, some part-word contractions are not allowed when translating proper names. (This is one of the issues that the forthcoming XTrans print-to-braille translator will address.)]
Particular rules are replacement rules applicable to the use of particular contractions. Particular rules can include both position-dependent and language-dependent restrictions. An example of a particular replacement rule with both types of restriction is the rule that the print sequence dis may be replaced by the one-cell braille contraction representing this sequence only when the sequence is at the beginning of a word and only when it constitutes an entire syllable.
General rules are language-dependent rules for the use of contractions that are intended to make braille more readable. General rules over-ride particular ones. Here are two of the several dozen general rules for contracted English braille:
By the way, reasonable people can and often do disagree on the application of the rules to certain words. Moreover, the same rules for English braille are often interpreted differently in the US and UK.
Translation of print to braille can be done in a context-free and efficient manner on modern computers by using tables or dictionaries of the desired braille translations of whole words. This is in contrast to an older approach which uses a rule-based algorithm for translating parts of words as well as whole words.
The use of print-to-braille dictionaries avoids the need to implement the language-dependent restrictions on the use of contractions. (Words not affected by these restrictions can either be included in the dictionary or be translated automatically using a simple algorithm that doesn't take into account these restrictions.) This approach has a number of advantages in comparison with the older rules-based approach:
Dictionary-based translation is very similar to how humans translate print to braille. Human translators examine an entire word and if they can't remember the braille translation and think there is any possibility that it is a special case, they locate the translation in a resource such as The Braille Enthusiast's Dictionary.
Rules-based translation is a well-known approximate algorithm for translating print to contracted braille. The defining characteristic of rules-based translation is the approximation of the language-dependent rules by various rules of thumb.
Given that braille rules not only entail syllabification and pronunciation but are open to interpretation, the most straightforward way of implementing a computer-based translation of print to contracted braille is the previously described use of a print-to-braille dictionary. However, the first braille transcribing applications were developed for the limited computer hardware of the early 1970's. Internal computer memories of that era could not accommodate a sufficiently large dictionary and repeated access of external storage media would have been too time-consuming. The rules-based approximation was an original and effective solution to these hardware limitations.
The rules-based approach implements the language-based rules of contracted braille codes by using addtional ad hoc replacement rules which have been derived from a detailed practical analysis of the target language. (This type of approach is sometimes called Natural Language Processing or NLP.) The ad hoc rules are merged with the official position-dependent replacement rules of the braille system to form a single undifferentiated set of translation rules.
It is important to appreciate the ramifications of the previous statement. Many linguists and computer scientists who've tried to gain familiarity with braille translation have been misled when examining one of these undifferentiated rule sets and have unfortunately come to the mistaken understanding that all of the rules are part of an official braille system when, in fact, the large majority of the rules in such sets are simply the ad hoc rules that have been devised by the software implementors. This situation is particularly unfortunate when the mistaken understanding has been the basis of an effort to improve braille software since this doesn't make it more accurate.
Highly-tuned rule sets used in commercial software are proprietary so it is difficult to know how many ad hoc language-based rules are required to attain reasonable accuracy. An examination of several open source braille translation software suggests that something like 800-1000 of such rules are the minimum needed to provide reasonably accurate translations to contracted English braille.
It is probably easiest to understand the rules-based approach by way of example. The following rules, which are for the use (and non-use) of the dis contraction in English braille, are just a small portion of a complete rule set. These rules are arranged in order of decreasing priority. The rules-based translation procedure always selects the highest-priority rule applicable to the particular item being translated.
In this example, rules 1-4 are the ad hoc rules for the pronunciation and syllabification of English words which start with the sequence dis while rule 5 implements the English braille system's default positional restriction on the use of the braille contraction for dis.
Rule 1 implements a specific case of the previously mentioned general rule concerning pronunciation. Rules 2-4 are rules of thumb for determining whether or not dis at the start of a word constitutes a syllable as required by both the specific rule for this contraction and the general rule prohibiting any contraction's overlapping a major syllable division. Note that rule 4 implements exceptions to rule 5; rule 3 implements exceptions to rule 4; and rule 2 implements exceptions to rule 3.
By the way, this set of rules for using the dis contraction is intended to be illustrative and is not complete. It would, for example, result in mistranslations of both disaccharide and disulfide.
The invention of the rules-based approach around 1970 was a breakthrough that made it possible to develop reasonably accurate contracted braille translating applications suitable for older computer hardware with small memories. However, it was soon realized*—as this example illustrates— that maintaining a rule set can be problematic due to the ad hoc nature of the rules and, especially, to the potential for interactions among them. Changes to rule 4 in the preceding example could necessitate changes to rule 3, etc.
A second, and more serious, difficulty with the rules-based approach, is the lack of any natural way of dealing with the numerous braille rules that depend on the nature of the word undergoing translation. Contracted braille systems used different rules for ordinary words, for proper names, for stammered words, for letter words, for compound words, etc.
The time has come to retire the rules-based approach.
The changes in computer hardware and computer software since the development of the first braille transcribing applications are almost unimaginable. The Braille-in-DAISY project has the potential of finally achieving the long-held dream of instant and accurate automatic conversion of print to braille.
We can be reminded of our far we've come by reflecting on the image to the above left which is a black-and-white photograph of the Keypunching Department at De Nederlandsche Blindenbibliotheek which was published in a 1976 document titled Automated Braille Production from Compositors Tapes. The photograph is a view from behind of three people sitting in front of their keypunch equipment. There is nest of compositors tape on the floor. At the time of the publication, the Dutch Library employed 250 people with 10% of their yearly production of 1,000,000 braille pages produced manually by volunteers, 10% by automated braille printers, 16% by employees, and the remainder from zinc plates. The only computer was a 12 bit PDP 8-E with 8K of core memory.
*However, prepartion and modification of the [rules] tables can be done only by someone with a good knowledge of the way in which the translation algorithm works. It has been found in practice that anomalies in the translation can emerge over a long period of usage of the system, and minor changes to the tables must be made to correct these. If not done sufficiently skillfully, these changes may themselves introduce new miscontractions.J. B. Humphreys, An Adaptive Braille Transcription System -- Progress Report, Braille Research Newletter, No. 11, (Warwick Research Unit for the Blind) August 1980.
First posted November 13, 2006
Slightly revised January 15, 2008
Revised version posted February 17, 2008
Revised version posted April 14, 2008
Contact author: info at dotlessbraille dot org