Issues for the Braille-in-DAISY Project

Click here for the updated (May 2011) version of this article

Current Issues for Automated Conversion of Print to Braille:
The Braille-in-DAISY Project

Executive Summary. There are no current applications that provide fully automated conversion of print to braille. Converting print to braille requires both formatting and translating. However, the major unsolved issue for automated conversion of non-technical print to braille is formatting. Braille-in-DAISY is most likely to succeed if it

learns from past efforts,

focuses most of its effort on formatting, and

makes effective use of modern computer hardware and modern approaches to text processing.

Introduction
Braille-in-Daisy
Learning from the Past
Formatting Braille

Document Formatting

Reference Print Documents
Tables

Character Formatting

Translating Print to Braille

Contracted Braille
- Particular Rules
- General Rules
Dictionary-based Translation
Rules-based Translation

Original Motivation for the Rules-Based Approach
How the Rules-Based Approach Works
An Example
Conclusion

Summary

Introduction

IBM 709 braille translation prooflist output from line printer: print interlined with simulated braille constructed from print period characters

The first commercial software for converting or transcribing printed documents to braille was developed for the IBM 709 in 1965. (The IBM 709 had a 32K 36-bit word memory.) This software translated print to contracted American English braille and even produced printed prooflists by "retranslating" the braille to print. Nonetheless, despite the many advances in text processing that have occurred in the ensuing 40-some years, there are still no software applications capable of providing fully-automated conversion of general print documents to properly-formatted braille.

Braille-in-Daisy

Now the new Braille-in-Daisy project has the goal of solving the problem of automated conversion of print to braille.

This project aims to improve the efficiency of Braille print production from DTBook by:

identifying and documenting the editing requirements of Braille

providing a vocabulary for Braille editing in DTBook

creating a printer independent, non-proprietary and universal "embosser ready" Braille format

providing a framework for, and a basic implementation of, a fully automated conversion between DTBook (including the Braille editing vocabulary) and the "embosser ready" Braille format

This should be accomplished by the end of 2007.

It may help the Braille-in-Daisy project to succeed if we consider why fully automated conversion of print to braille is not already a reality.

Learning from the Past

Many proprietary and non-proprietary braille transcribing applications have been developed in the last 40 years. Their developers have clearly had the goal of producing accurate braille with greater automation. It is appropriate for us to look back and consider this history before moving forward with yet another braille transcribing application.

The following is my understanding of the five main problems facing automated braille transcription today:

Braille formatting rules are complicated and can be difficult to automate.
Electronic publishing has created a "moving target" as far as the increasing variety of the print formats to be adapted to braille.
There are a significant number of different electronic document formats used in the print world.
The developers of early braille transcribing applications necessarily spent much of their effort on devising ad hoc rules-based algorithms to address the constraints of automating the translating of print to contracted braille when using older computer hardware with limited memory capacity.
The small market for braille transcribing software coupled with standard software issues—maintenance of legacy code, requirements for upward-compatibility, expensive customer support, etc.—have not provided an opportunity for starting over from the ground up. Many improvements have been evolutionary, not revolutionary.

Addressing these problems in turn we note:

We likely have little control over the formatting rules specified by the national braille authorities.
The growing using of semantic markup and metadata makes it easier to recognize the essential nature of print formats.
The problem of different electronic formats is already being addressed by the DAISY Pipeline project which is developing converters from many other formats to DAISY DTBook. The current project will provide any extensions to DTBook necessary to support conversion to braille.
The current software development environment is completely different than it was even 10 years ago, let alone 40 years ago. Modern large-memory computer hardware provide for much simpler algorithms for braille translating than were possible on earlier hardware. Modern software tools, new standards, and new approaches to text processing—Java, Python, XML, Unicode, i18n, XSL, etc.—have led to success in related areas.
Braille-in-DAISY is not constrained by past practice.

Conclusion. The biggest areas of risk are items 1. and 4. The Braille-in-DAISY project will succeed to the extent that its members understand the unique difficulties of braille formatting and are able to make use of the new approaches made possible by current hardware and software environments.

Formatting Braille

Formatting refers to issues in braille transcribing related to the representation of the layout and styling of print documents. There are two types of braille formatting: direct document formatting and indirect character formatting.

Document Formatting

Document formatting refers to any formatting of braille documents that is implemented directly by rendering including centering, indenting, line breaks, and table layout.

Braille has special formats for many different types of documents and document elements including cartoons, cross references, columnar material, directions, exercises, footnotes, glossaries, hyperlinks, indexes, lists, plays, poems, stage directions, tables, and tables of contents.

Some aspects of braille document formatting are simply alternate renderings of their print analogues. In these cases, the correct braille formatting can be generated automatically if the print documents use semantic tagging for document elements. However, proper braille document formatting also requires addressing difficult braille-specific issues.

A detailed description of braille formatting rules is beyond the scope of this article. The formatting rules for American English Braille are online at brl.org. However, two examples are included to illustrate the general nature of the difficulties. One problem area is the need to include formatting information that reflects the structure of the rendered print document being transcribed. A second problem area is in the handling of tables.

Reference Print Documents

When a print document is transcribed to braille, there is often a requirement that the formatting of the braille document reflect certain aspects of the print document. For example, print page numbers as well as braille page numbers must often be included.

A second case where the reference print document affects the formatting of the braille occurs with poetry. Here, not surprisingly, the relative indentation of the poetic lines in braille is the same as that of the print. However, braille lines typically contain less information than print lines so a poetic line in braille may have to be run over to the next line even though the print is a single line. There is a special rule for such runovers: if a line of poetry runs over in braille, it must be indented "two cells to the right of the beginning of the farthest indented poetic line in the entire poem." In other words, the indentation of a runover line in a braille transcription of a poem requires knowledge of the indentation structure of the entire printed poem.

Tables

The formatting of tables in braille must be done according to strict rules so as to ensure readability. These rules include changes to the table structure as necessary to accommodate the limited width of a braille page.

The American English braille rules for formatting tables are available online. Some of the major considerations for transcribing tables include the division of long tables, reformatting using a stairstep model and/or inverting rows and columns, and abbreviating the text of column and row headers to save space. Some of the other differences from print include the placement of captions, the omission of outer lines, and the use of guide dots. Many of these changes require documentation in the form of transcriber's notes.

Tables can present additional difficulties for contracted braille codes in cases where generating the layout requires knowing the size of table entries. This is because the number of braille cells required to translate a print item to contracted braille is often different from the number of characters in the print item.

Character Formatting

Character formatting refers to inline formatting or styling such as italics that is implemented indirectly with the use of embedded markup. Braille symbols used as markup are typically referred to as composition indicators.

Transcribing of character formatting presents a number of interesting problems over and above simply translating certain XML markup tags to braille indicators.

Scoping rules can be difficult to implement. The scope of braille composition indicators are typically implicit: explicit end tags are used only when the implicit scope is not what is intended. Also, the rules for the use of certain indicators depends on the number of affected items. For example, if three or less words are italicized, each word is italicized with its own composition indicator. However, a different mechanism is used to indicate italicized passages of more than three words.

Braille character formatting sometimes follows print rendering and sometimes follows print semantics. As an example of character formatting that depends on semantics, braille uses different composition indicators for titles and for symbols even though both items are typically styled similarly in print.

Braille has many fewer styling alternatives than print and different styles of print highlighting are not usually distinguished. In fact, print styling that is primarily decorative is often not indicated at all in braille. However, difficulties can occur such as when the print text makes explicit reference to a particular rendering, e.g. "Students should memorize all definitions shown in blue type."

Translating Print to Braille

Translating refers to replacing print characters or sequences of print characters by those braille cells that represent the characters plus adding special braille indicators used as necessary to mark changes in the semantics or meanings of the braille cells. There are two types of braille systems: uncontracted and contracted.

Uncontracted braille replaces each individual print character with the corresponding single-cell or multi-cell braille symbol. The replacement rules for uncontracted (Grade 1) braille are context-free. Automated translation of print to uncontracted braille is straightforward and is not further discussed in this article.

Contracted braille extends uncontracted braille's replacement rules for print characters with additional rules for replacing common print words and other common sequences of print letters with special braille symbols known as contractions. Many of the replacement rules for contracted (Grade 2) braille are context-sensitive. Contracted English braille has approximately 300 replacement rules which are described in the next section. Automated translation of print text to contracted braille is complicated by the language-dependent restrictions for the use and non-use of contractions which are also described in the next section.

Following the description of contracted braille are discussions of two alternative approaches for automated translation of print to contracted braille: dictionary-based and rule-based.

Contracted Braille

Contracted braille systems contain both position-dependent and language-dependent restrictions on the use of contractions. A position-dependent restriction is one that limits the use of a contraction to certain locations within a word although many contractions can be used in any location. A language-dependent restriction is one that depends on the syllabification or pronunciation of a word. Both types of rules can apply to particular contractions while many language-dependent rules are applicable in general.

[In addition to the particular and general rules, braille systems also have requirements on the use of certain contractions that depend on either the semantics or syntax of the item being translated. For example, some part-word contractions are not allowed when translating proper names. (This is one of the issues that the forthcoming XTrans print-to-braille translator will address.)]

Particular Rules

Particular rules are replacement rules applicable to the use of particular contractions. Particular rules can include both position-dependent and language-dependent restrictions. An example of a particular replacement rule with both types of restriction is the rule that the print sequence dis may be replaced by the one-cell braille contraction representing this sequence only when the sequence is at the beginning of a word and only when it constitutes an entire syllable.

General Rules

General rules are language-dependent rules for the use of contractions that are intended to make braille more readable. General rules over-ride particular ones. Here are two of the several dozen general rules for contracted English braille:

Contractions may be used where the print letters would overlap a minor syllable division, e.g. the contraction for in is used in tiny. However, contractions should not be used if they would overlap a major syllable division, e.g. the contraction for in is not used in binomial and that for dis is not used in dishevel.
Where a choice must be made between two consecutive contractions, preference should be given to the one which more nearly indicates correct pronunciation, e.g. the contraction for spirit rather than the one for dis is used in translating dispirited.

By the way, reasonable people can and often do disagree on the application of the rules to certain words. Moreover, the same rules for English braille are often interpreted differently in the US and UK.

Dictionary-Based Translation

Translation of print to braille can be done in a context-free and efficient manner on modern computers by using tables or dictionaries of the desired braille translations of whole words. This is in contrast to an older approach which uses a rule-based algorithm for translating parts of words as well as whole words.

The use of print-to-braille dictionaries avoids the need to implement the language-dependent restrictions on the use of contractions. (Words not affected by these restrictions can either be included in the dictionary or be translated automatically using a simple algorithm that doesn't take into account these restrictions.) This approach has a number of advantages in comparison with the older rules-based approach:

not context-sensitive
more efficient since it avoids the need to re-translate the same word numerous times
considerably simpler to implement and maintain
easier for user to customize
possible to achieve 100% accuracy for any given document
minimizes the need for proofreading since the user only has to proof each word in the dictionary once rather than proofing every translation

Dictionary-based translation is very similar to how humans translate print to braille. Human translators examine an entire word and if they can't remember the braille translation and think there is any possibility that it is a special case, they locate the translation in a resource such as The Braille Enthusiast's Dictionary.

Rules-Based Translation

Rules-based translation is a well-known approximate algorithm for translating print to contracted braille. The defining characteristic of rules-based translation is the approximation of the language-dependent rules by various rules of thumb.

Original Motivation for the Rules-Based Approach

Given that braille rules not only entail syllabification and pronunciation but are open to interpretation, the most straightforward way of implementing a computer-based translation of print to contracted braille is the previously described use of a print-to-braille dictionary. However, the first braille transcribing applications were developed for the limited computer hardware of the early 1970's. Internal computer memories of that era could not accommodate a sufficiently large dictionary and repeated access of external storage media would have been too time-consuming. The rules-based approximation was an original and effective solution to these hardware limitations.

How the Rules-Based Approach Works

The rules-based approach implements the language-based rules of contracted braille codes by using addtional ad hoc replacement rules which have been derived from a detailed practical analysis of the target language. (This type of approach is sometimes called Natural Language Processing or NLP.) The ad hoc rules are merged with the official position-dependent replacement rules of the braille system to form a single undifferentiated set of translation rules.

It is important to appreciate the ramifications of the previous statement. Many linguists and computer scientists who've tried to gain familiarity with braille translation have been misled when examining one of these undifferentiated rule sets and have unfortunately come to the mistaken understanding that all of the rules are part of an official braille system when, in fact, the large majority of the rules in such sets are simply the ad hoc rules that have been devised by the software implementors. This situation is particularly unfortunate when the mistaken understanding has been the basis of an effort to improve braille software since this doesn't make it more accurate.

Highly-tuned rule sets used in commercial software are proprietary so it is difficult to know how many ad hoc language-based rules are required to attain reasonable accuracy. An examination of several open source braille translation software suggests that something like 800-1000 of such rules are the minimum needed to provide reasonably accurate translations to contracted English braille.

An Example

It is probably easiest to understand the rules-based approach by way of example. The following rules, which are for the use (and non-use) of the dis contraction in English braille, are just a small portion of a complete rule set. These rules are arranged in order of decreasing priority. The rules-based translation procedure always selects the highest-priority rule applicable to the particular item being translated.

the sequence dispirit at the start of a word is replaced by the braille cells for the letters d and i and by the contraction for spirit (i.e. in dispirit or dispirited)
the sequence dishev at the start of a word is replaced by the braille cells for the letters d and i, the braille cell for the contraction for sh, and the braille cells for the letters e and v (i.e. in dishevel ordishevelled)
the sequence dis at the start of a word when followed by cu, ha, he, ho is replaced by the dots-256 contraction for dis (e.g. in discus, dishabille, dishearten, and dishonest)
the sequence dis at the start of a word when followed by c, h, or k is replaced by the braille cells for the letters d, i, and s (e.g. in disc, dish, dishcloth, disk, or diskette)
the sequence dis at the start of a word is replaced by the dots-256 contraction for dis

In this example, rules 1-4 are the ad hoc rules for the pronunciation and syllabification of English words which start with the sequence dis while rule 5 implements the English braille system's default positional restriction on the use of the braille contraction for dis.

Rule 1 implements a specific case of the previously mentioned general rule concerning pronunciation. Rules 2-4 are rules of thumb for determining whether or not dis at the start of a word constitutes a syllable as required by both the specific rule for this contraction and the general rule prohibiting any contraction's overlapping a major syllable division. Note that rule 4 implements exceptions to rule 5; rule 3 implements exceptions to rule 4; and rule 2 implements exceptions to rule 3.

By the way, this set of rules for using the dis contraction is intended to be illustrative and is not complete. It would, for example, result in mistranslations of both disaccharide and disulfide.

Old photo of one woman and two men working in the Keypunching Department at De Nederlandsche Blindenbibliotheek

Conclusion

The invention of the rules-based approach around 1970 was a breakthrough that made it possible to develop reasonably accurate contracted braille translating applications suitable for older computer hardware with small memories. However, it was soon realized^*—as this example illustrates— that maintaining a rule set can be problematic due to the ad hoc nature of the rules and, especially, to the potential for interactions among them. Changes to rule 4 in the preceding example could necessitate changes to rule 3, etc.

A second, and more serious, difficulty with the rules-based approach, is the lack of any natural way of dealing with the numerous braille rules that depend on the nature of the word undergoing translation. Contracted braille systems used different rules for ordinary words, for proper names, for stammered words, for letter words, for compound words, etc.

The time has come to retire the rules-based approach.

Summary

The changes in computer hardware and computer software since the development of the first braille transcribing applications are almost unimaginable. The Braille-in-DAISY project has the potential of finally achieving the long-held dream of instant and accurate automatic conversion of print to braille.

We can be reminded of our far we've come by reflecting on the image to the above left which is a black-and-white photograph of the Keypunching Department at De Nederlandsche Blindenbibliotheek which was published in a 1976 document titled Automated Braille Production from Compositors Tapes. The photograph is a view from behind of three people sitting in front of their keypunch equipment. There is nest of compositors tape on the floor. At the time of the publication, the Dutch Library employed 250 people with 10% of their yearly production of 1,000,000 braille pages produced manually by volunteers, 10% by automated braille printers, 16% by employees, and the remainder from zinc plates. The only computer was a 12 bit PDP 8-E with 8K of core memory.

^*However, prepartion and modification of the [rules] tables can be done only by someone with a good knowledge of the way in which the translation algorithm works. It has been found in practice that anomalies in the translation can emerge over a long period of usage of the system, and minor changes to the tables must be made to correct these. If not done sufficiently skillfully, these changes may themselves introduce new miscontractions.
J. B. Humphreys, An Adaptive Braille Transcription System -- Progress Report, Braille Research Newletter, No. 11, (Warwick Research Unit for the Blind) August 1980.

First posted November 13, 2006
Slightly revised January 15, 2008
Revised version posted February 17, 2008
Revised version posted April 14, 2008
Contact author: info at dotlessbraille dot org

Click here for the updated (May 2011) version of this article

Current Issues for Automated Conversion of Print to Braille: The Braille-in-DAISY Project

Contents

Current Issues for Automated Conversion of Print to Braille:
The Braille-in-DAISY Project