However it is only now that serious attempts are being made to assemble information on sources of "clean" text in digital form. In the next few years, as the printing industry gradually changes to computer-based compositing systems, the availability of braille could be significantly increased. It is also technically feasible to eliminate multiple typing of material when inkprint, large print and braille editions are required.
J.M. Gill, "The Use of Digitally-Stored Text for Braille Production," Braille Automation Newsletter, (Warwick Research Unit for the Blind) August 1976, p. 10.

Current Issues for Automated Conversion of Print to Braille:
Incorporating Braille in DAISY Pipeline 2

Executive Summary. There are no current applications that provide fully automated conversion of print to braille. Converting print to braille requires both formatting and translating. The major unsolved issue for automated conversion of non-technical print to braille is formatting. A second key issue is the lack of support for translation rules that depend on semantics. The current (2011) effort to incorporate braille production in DAISY Pipeline 2 is most likely to succeed if it

  1. learns from previous efforts
  2. applies significant attention to formatting issues
  3. identifies and addresses systematic problems in current braille software
  4. makes effective use of modern computer hardware
  5. leverages modern approaches to automated text processing.

Contents

Introduction

IBM 709 braille translation prooflist output from line printer: print interlined with simulated braille constructed from print period characters

The first commercial software for converting or transcribing printed documents to braille was developed for the IBM 709 in 1965. (The IBM 709 had a 32K 36-bit word memory.) This software translated print to contracted American English braille and even produced printed prooflists by "retranslating" the braille to print. Nonetheless, despite the many advances in text processing and the many significantly more difficult software problems that have been solved in the ensuing 45 years, there are still no software applications capable of providing automated conversion of general print documents to accurate braille.

Braille Production within the DAISY Pipeline

DAISY has a goal of incorporating braille production in DAISY Pipeline 2.

This goal is a follow-on to the earlier Braille-in-Daisy project which was not completed for several reasons. One issue was that the specification of DTBook was inadequate for supporting some aspects of braille production including mathematics. Meanwhile it turned out that DTBook needed significant improvements for reasons unrelated to braille production and it has now been superceded by the greatly-improved and entirely new specification known colloquially as ZedAI.

One outcome of Braille-in-Daisy was the specification of the Portable Embosser Format (PEF) and development of software support for producing braille files in this format. PEF is a printer-independent, non-proprietary and universal "embosser ready" Braille format.

The work of addressing braille production for Pipeline 2 has been divided into two phases. The first phase being carried out by the first Braille Production Working Group is currently (May 2011) underway. This group is attempting to quantify the current situation for braille production and to ensure that ZedAI markup includes all the features necessary to support automated braille production. It is also adding some preliminary components to the DAISY Pipeline This first phase is scheduled to be completed by September 2011.

The second phase is planned to begin in October 2011. The work scope has not yet been quantified but is intended to complete whatever support is necessary for braille production from ZedAI.

Learning from the Past

[Note: The Braille Authority of North America (BANA) has very recently published the first part of a proposed three-part article titled The Evolution of Braille: Can the Past Help Plan the Future? which provides additional historical background and perspective on the impact on braille useage caused by the widespread use of computer technology.]
It may help the DAISY-based braille production project to succeed if we consider why fully automated conversion of print to braille is not already a reality.

Many proprietary and non-proprietary braille transcribing applications have been developed during the time since 1965. Their developers have clearly had the goal of producing accurate braille with greater automation. It is appropriate for us to look back and consider this history before moving forward with yet another approach to braille transcribing.

The following is my understanding of the five main problems facing automated braille transcription today:

  1. Braille formatting rules are complicated and can be difficult to automate.
  2. Electronic publishing has created a "moving target" as far as the increasing variety of the print formats to be adapted to braille.
  3. There are a significant number of different electronic document formats used in the print world.
  4. The developers of early braille transcribing applications necessarily spent much of their effort on devising ad hoc rules-based algorithms to address the constraints of automating the translating of print to contracted braille when using older computer hardware with limited memory capacity.
  5. The small market for braille transcribing software coupled with standard software issues—maintenance of legacy code, requirements for upward-compatibility, expensive customer support, etc.—have not provided commercial braille software providers the opportunity for starting over from the ground up. Many improvements have been evolutionary, not revolutionary.

Addressing these problems in turn we note:

  1. We likely have little control over the formatting rules specified by the national braille authorities.
  2. The growing using of semantic markup and metadata makes it easier to recognize the essential nature of print formats.
  3. The problem of different electronic formats is already being addressed by the DAISY Pipeline project which is developing converters from many other formats to DAISY DTBook and ZedAI. The current project will provide any extensions to ZedAI necessary to support conversion to braille.
  4. The current software development environment is completely different than it was even 10 years ago, let alone when the first braille applications were designed and developed. Modern large-memory computer hardware provide for simpler and more accurate algorithms for braille translating than were possible on earlier hardware. Modern software tools, new standards, and new approaches to text processing—Java, Python, XML, Unicode, i18n, XSL, etc.—have led to success in related areas.
  5. DAISY-based braille production is not constrained by past practice and can utilize open source as well as commercial tools.

Conclusion. The biggest areas of risk are items 1. and 4. The project of providing new automated braille production tools will succeed to the extent that the project accurately characterizes the unique difficulties of braille production and make effective use of the many new approaches made possible by current hardware and software technology.

Formatting Braille

Formatting refers to issues in braille transcribing related to the representation of the layout and styling of print documents. There are two types of braille formatting: direct document formatting and indirect character formatting.

Document Formatting

Document formatting refers to any formatting of braille documents that is rendered directly with whitespace; this includes centering, indenting, line breaks, and table layout.

Braille has special formats for many different types of documents and document elements including cartoons, cross references, columnar material, directions, exercises, footnotes, glossaries, hyperlinks, indexes, lists, plays, poems, stage directions, tables, and tables of contents.

Some aspects of braille document formatting are simply alternate renderings of their print analogues. In these cases, the correct braille formatting can be generated automatically if the print documents use semantic tagging for document elements. However, proper braille document formatting also requires addressing difficult braille-specific issues.

A growing problem in transcribing print documents to braille is the determining of the proper (linear) presentation (reading) order when transcribing print documents formatted with items such as text boxes, sidebars and multiple columns. A contrasting problem is the need to avoid loss of information by reflecting certain aspects of the actual layout of the print document. A third problem area is in the handling of intrinsically planar arrangements such as tables.

A detailed description of braille formatting rules is beyond the scope of this article. (Note that the 1997 edition of the formatting rules for American English Braille is online at brl.org; the rules and later errata and updates may be downloaded from BANA.) However, the examples described in the next sections were chosen to illustrate the general nature of the problem. The final example is one of the unique aspects of braille formatting not typically handled by formatting applications designed for print documents.

Determination of Reading Order

Braille documents support efficient tactile reading by avoiding nonlinear layouts such as those using multiple columns. Resolving this issue is known as determining the desired reading order of a print source document. A member of the Braille Working Group recently characterized the problem this way.

[We] transcribe schoolbooks into Braille. These books have relatively little continuous text. The text is frequently interrupted by sidebars, coloured boxes, explanatory illustrations, etc.

Note that the documention for the Acrobat Professional TouchUp Reading Order tool and related accessibility features provide an excellent introduction to this problem and to some tools for addressing it.

Reference Print Documents

When a print document is transcribed to braille, there is often a requirement that the formatting of the braille document reflect certain aspects of a particular edition of the print document. For example, print page numbers as well as braille page numbers must often be included.

A second case where the reference print document affects the formatting of the braille occurs with poetry. Here, not surprisingly, the relative indentation of the poetic lines in braille is the same as that of the print. However, braille lines typically contain less information than print lines so a poetic line in braille may have to be run over to the next line even though the print is a single line. American English braille has a special rule for such runovers: if a line of poetry runs over in braille, it must be indented "two cells to the right of the beginning of the farthest indented poetic line in the entire poem." In other words, the indentation of a runover line in a braille transcription of a poem requires knowledge of the indentation structure of the entire printed poem.

Tables

The formatting of tables in braille must be done according to strict rules so as to ensure readability. These rules include changes to the table structure as necessary to accommodate the fixed size of braille cells and the limited width of a braille page.

The online version of the American English braille rules for formatting tables details a number of issues. Some of the major considerations when transcribing tables include the division of long tables, reformatting using a stairstep model and/or inverting rows and columns, and abbreviating the text of column and row headers to save space. Some other possible differences from print include the placement of captions, the omission of outer lines, and the use of guide dots. Many of these changes require documentation in the form of transcriber's notes.

Tables can present additional difficulties for contracted braille codes in cases where generating the layout requires knowing the size of table entries. This is because the number of braille cells required to translate a print item to contracted braille is often different from the number of characters in the print item.

Some Unique Aspects of Braille Formatting

Embossed paper braille is bulky and some braille formatting prescriptions are intended to reduce this bulk. One example is the sometimes prescribed inclusion of body text on header or footer lines that are also used for page numbers. Print applications generally treat headers and footers separately from body text which, of course simplifies pagination. Print-based applications that correct page number references in body text do need to make multiple passes but even this is simpler than handling the braille problem where the actual text that goes on a page cannot be determined in advance of pagination.

Of course one solution is to eliminate the option to require the inclusion of body text in braille headers and footers. However, given the speed of modern computer hardware and the capabilities of modern algorithm developers, there is no reason for braille software not to support this option whether or not it is seen as of much benefit.

Character Formatting

Character formatting refers to inline formatting or styling such as italics that is implemented indirectly with the use of markup intended to be interpreted by the braille reader. Braille symbols used as markup are typically referred to as composition indicators.

Transcribing of character formatting presents a number of interesting problems over and above simply translating certain print-based markup tags to the corresponding braille indicators. The source of these problems is that the rules for using braille indicators are designed to enhance human readability while those for XML or other print-based markup are designed for machine processing. Some examples of these problems are given in the following sections.

Scoping Rules

Scoping rules require special attention. The scope of braille composition indicators are typically implicit; explicit termination indicators, which add clutter, are used only when the default implicit scope is not what is intended.

Different Treatment of Short and Long Passages

Another difference from standard markup protocols is that the rules for the use of certain indicators depends on the number of affected items. For example, if three or less words are italicized, each word is italicized with its own composition indicator. However, a different mechanism is used to avoid clutter in italicized passages of more than three words.

Semantic Considerations

Braille character formatting sometimes follows print rendering and sometimes follows print semantics. As an example of character formatting that depends on semantics, braille uses different composition indicators for titles and for symbols even though both items are typically styled similarly in print. Similarly, adjacent italicized items which are italicized for different reasons, e.g. one for emphasis and the other because it is a title, are marked up separately.

Need For Human Input

Braille has many fewer styling alternatives than print and addressing this difficulty may require human judgement. For example, different styles of print highlighting are often not distinguished in braille. In fact, print styling that is primarily decorative may sometimes be ignored in braille. However, this practice can lead to difficulties such as when the print text or classroom teacher makes explicit reference to a particular rendering, e.g. "Students should memorize all definitions shown in blue type."

Translating Print to Braille

Translating refers to replacing print characters or sequences of print characters by those braille cells that represent the characters plus inserting special braille indicators used as necessary to mark changes in the semantics or meanings of the braille cells.

Procedures for inserting indicators can be quite complex and further discussion is outside the scope of this article. As for replacing print characters with braille characters, there are two types of braille systems used to translate literary or non-technical materials: uncontracted and contracted.

Uncontracted braille replaces each individual print character with the corresponding single-cell or multi-cell braille symbol. The replacement rules for uncontracted (Grade 1) braille are context-free although the rules for the use of composition indicators may depend on context. Automated translation of print to uncontracted braille is not further discussed in this article. However it should be pointed out that uncontracted braille requires the same procedures for inserting indicators as does contracted braille.

Contracted braille extends uncontracted braille's replacement rules for print characters with additional rules for replacing common print words and other common sequences of print letters with special braille symbols known as contractions. Many of the replacement rules for contracted (Grade 2) braille are context-sensitive. Contracted English braille has approximately 300 replacement rules which are described in the next section. Automated translation of print text to contracted braille is complicated by the language-dependent restrictions for the use and non-use of contractions which are also described in the next section.

Following the description of contracted braille are discussions of two alternative approaches for automated translation of print to contracted braille: dictionary-based and rule- or translation-table-based.

Overview of Contracted Braille Rules and Restrictions

Contracted braille systems can have up to four types of restrictions and rules on the use of contractions and indicators

Syntax restrictions are ones that depend on the nature or arrangement of characters. Semantic restrictions are ones that depend on the meaning or use of a word. Position-dependent restrictions are ones that limit the use of a contraction to certain relative positions or locations within a word. Language-dependent restrictions are ones that depend on the syllabification or pronunciation of a word.

Persons familiar with one or more braille systems are typically more aware of the position- and language-dependent restrictions described further below. Nonetheless, it is often the syntax- and semantics-dependent rules and requirements that cause the most translation errors and the most difficulties for automated translation.

The nature of syntax-dependent rules depends, of course, on the particular braille system. An example of a rule determined by syntax taken from English Braille American Edition (EBAE) is the rule for the use of the letter sign and number sign indicators in an alphanumeric items. A second example is the rule that prohibits the use of contractions in partially-emphasized words which are words where some but not all of the letters are italicized or non-titlecase words where some but not all of the letters are capitalized.

An example of a commonly-used semantics-dependent rule is the restriction on the use of certain contractions when translating proper names. EBAE has a number of semantic-dependent translation rules for hyphenated items which are described in the article titled Issues for Braille Translation of Hyphenated Items.

Automated handling of syntax- and semantics-dependent braille translation will require a new, more sophisticated, approach. This issue, which is not addressed here, is discussed in the article titled The Requirement for Multiple Translation Algorithms to Produce Accurate Braille.

Contracted braille systems include particular and general rules restricting the use of contraction as detailed in the following two sections.

Particular Rules

Particular rules are replacement rules applicable to the use of particular contractions or types of contractions. An example of a particular replacement rule with two types of restrictions is the rule for replacing the print sequence dis with the one-cell braille contraction representing this sequence. This rule has a position-dependent restriction in that it may only be used when the sequence is at the beginning of a word. It also has a language-dependent restriction in that it may only be used when it constitutes an entire syllable.

General Rules

General rules are language-dependent rules for the use of contractions that are intended to make braille more readable. General rules over-ride particular ones. Here are two of the several dozen general rules for contracted English braille:

  1. Contractions may be used where the print letters would overlap a minor syllable division, e.g. the contraction for in is used in tiny. However, contractions should not be used if they would overlap a major syllable division, e.g. the contraction for in is not used in binomial and that for dis is not used in dishevel.
  2. Where a choice must be made between two consecutive contractions, preference should be given to the one which more nearly indicates correct pronunciation, e.g. the contraction for spirit rather than the one for dis is used in translating dispirited.

By the way, reasonable people can and often do disagree on the application of the rules to certain words. Moreover, the same rules for English braille are often interpreted differently in the US and UK.

Introduction to Automated Translation of Print to Contracted Braille

Much of the past effort in automating braille translation has focused on translating print to contracted literary braille. Less attention has been paid to automating the translation of mathematics and other technical material. It is not entirely clear why this is the case. It may be in part because translation to contracted braille was the first problem that seemed amenable to automation. Another reason may be because braille training materials and related information typically overemphasize contraction useage as being the defining characteristic of braille. In reality, as proponents of uncontracted literary braille well know, it is carefully-designed formatting conventions, indicator useage, and numerous specialized braille codes that make possible the capability of braille documents to accurately reflect printed information despite being limited to only 63 unstyled braille cells.

Further discussion of the automated production of braille technical materials is outside the scope of this article. However, given the significant resources devoted to automated translation to contracted braille, the remainder of this article attempts to demystify the current algorithms used for this purpose. Nonetheless the reader should be aware that any algorithm based simply on substitution cannot handle all the difficulties of braille translation. Consider, for example, the translation of the alphanumeric sequences abc123 and 123abc to literary braille. Both sequences require the insertion of a number sign indicator before the first digit but only the latter requires the insertion of a letter sign indicator before the first letter.

Dictionary-Based Print-to-Braille Translation

Translation of print to braille can be done in a context-free and highly efficient manner on modern computers by searching lists or dictionaries of the desired braille translations of whole words. This is in contrast to an older and more well-known approach which uses a rule-based algorithm employing so-called "translation tables" for translating parts of words as well as whole words.

The use of print-to-braille dictionaries avoids the need to use heuristics or other special approaches to handle the language-dependent restrictions on the use of contractions. (Words not affected by these restrictions can be either included in the dictionary or translated automatically using a simple algorithm that doesn't take into account these restrictions.) This approach has a number of advantages in comparison with the older rules-based approach:

Dictionary-based translation is very similar to how humans translate print to braille. Human translators examine an entire word and if they can't remember the braille translation and think there is any possibility that it is a special case, they locate the translation in a resource such as The Braille Enthusiast's Dictionary.

Replacement Rules- or Translation Table-Based Print-to-Braille Translation

Rules-based translation is a well-known approximate algorithm for translating print to contracted braille. The defining characteristic of rules-based translation is the approximation of the language-dependent rules by various rules of thumb.

Original Motivation for the Rules-Based Approach

Given that braille rules not only entail syllabification and pronunciation but are open to interpretation, the most straightforward way of implementing a computer-based translation of print to contracted braille is the previously described use of a print-to-braille dictionary. However, the first braille transcribing applications were developed for the limited computer hardware of the early 1970's. Internal computer memories of that era could not accommodate a sufficiently large dictionary and repeated access of external storage media would have been too time-consuming. The rules-based approximation, which was devised in 1970 by mathematician Jonathan Millen, was an original and effective solution to these hardware limitations. (Millen, Jonathan K., Finite-State-Syntax Directed Braille Translation, Technical Report MTR-1829, MITRE Corporation, Bedford, Massachusetts, July 2, 1970.)

Current Motivation for Continued Use of the Rules-Based Approach

The rules-based or "translation table" approach is still the basis of many commercial and open source braille translation applications. Given the long-standing awareness of the shortcomings of this 40-year old approach with respect to accuracy and maintenance, it is interesting to speculate why it remains so popular. I can think of two possibilties. First, of course, is tradition. The community of braille software developers is very small and there seems to be limited commercial advantage to competing with respect to increasing accuracy. The more significant reason is likely the trade-off between accuracy and simplicity with respect to supporting internationalization and localization by simply using different input tables for different braille systems. The latest (11.1sr2) version of the Duxbury Braille Translator (DBT) supports over 130 languages in this manner.

How the Rules-Based Approach Works

The rules-based approach implements the language-based rules of contracted braille codes by using addtional ad hoc replacement rules which have been derived from a detailed practical analysis of the target language. (This type of approach is sometimes called Natural Language Processing or NLP.) The ad hoc replacement rules are merged with the official position-dependent replacement rules of the braille system to form a single undifferentiated set of translation rules.

It is important to appreciate the ramifications of the previous statement. Many linguists and computer scientists who've tried to gain familiarity with braille translation have been misled when examining one of these undifferentiated rule sets and have unfortunately come to the mistaken understanding that all of the rules are part of an official braille system when, in fact, the large majority of the rules in such sets are simply the ad hoc rules that have been devised by the software implementors. This situation is particularly unfortunate when the mistaken understanding has been the basis of an effort to improve braille software since this doesn't make it more accurate.

Highly-tuned rule sets used in commercial software are proprietary so it is difficult to know how many ad hoc language-based rules are required to attain reasonable accuracy. An examination of several open source braille translation software suggests that something like 800-1000 of such rules are the minimum needed to provide reasonably accurate translations to contracted English braille.

An Example

It is probably easiest to understand the rules-based approach by way of example. The following rules, which are for the use (and non-use) of the dis contraction in English braille, are just a small portion of a complete rule set. These rules are arranged in order of decreasing priority. The rules-based translation procedure always selects the highest-priority rule applicable to the particular item being translated.

  1. the sequence dispirit at the start of a word is replaced by the braille cells for the letters d and i and by the contraction for spirit (i.e. in dispirit or dispirited)
  2. the sequence dishev at the start of a word is replaced by the braille cells for the letters d and i, the braille cell for the contraction for sh, and the braille cells for the letters e and v (i.e. in dishevel or dishevelled)
  3. the sequence dis at the start of a word when followed by cu, ha, he, ho is replaced by the dots-256 contraction for dis (e.g. in discus, dishabille, dishearten, and dishonest)
  4. the sequence dis at the start of a word when followed by c, h, or k is replaced by the braille cells for the letters d, i, and s (e.g. in disc, dish, dishcloth, disk, or diskette)
  5. the sequence dis at the start of a word is replaced by the dots-256 contraction for dis

In this example, rules 1-4 are the ad hoc rules for the pronunciation and syllabification of English words which start with the sequence dis while rule 5 implements the English braille system's official default positional restriction on the use of the braille contraction for dis.

Rule 1 implements a specific case of the previously mentioned general rule concerning pronunciation. Rules 2-4 are rules of thumb for determining whether or not dis at the start of a word constitutes a syllable as required by both the specific rule for this contraction and the general rule prohibiting any contraction's overlapping a major syllable division. Note that rule 4 implements exceptions to rule 5; rule 3 implements exceptions to rule 4; and rule 2 implements exceptions to rule 3.

By the way, this set of rules for using the dis contraction is intended to be illustrative and is not complete. It would, for example, result in mistranslations of both disaccharide and disulfide.

Conclusion

The invention of the rules-based approach around 1970 was a breakthrough that made it possible to develop reasonably accurate contracted braille translating applications suitable for older computer hardware with small memories. However, it was soon realized*—as this example illustrates— that maintaining a rule set can be problematic due to the ad hoc nature of the rules and, especially, to the potential for interactions among them. Changes to rule 4 in the preceding example could necessitate changes to rule 3, etc.

A second, and more serious, difficulty with the rules-based approach is the lack of any natural way of dealing with the numerous braille rules that depend on the semantics or nature of the word undergoing translation. Contracted braille systems used different rules for ordinary words, for proper names, for stammered words, for letter words, for compound words, etc.

The time has come to retire the rules-based approach.

Summary

Old photo of one woman and two men working in the Keypunching Department at De Nederlandsche Blindenbibliotheek

The changes in computer hardware and computer software since the development of the first braille transcribing applications are almost unimaginable. The project to incorporate braille production in DAISY Pipeline 2 has the potential of finally achieving the long-held dream of instant and accurate automatic conversion of print to braille.

We can be reminded of our far we've come with respect to hardware by reflecting on the image to the left. This is a copy of black-and-white photograph of the Keypunching Department at De Nederlandsche Blindenbibliotheek which was published in a 1976 document titled Automated Braille Production from Compositors Tapes. The photograph is a view from behind of three people sitting in front of their keypunch equipment. There is nest of compositors tape on the floor. At the time of the publication, the Dutch Library employed 250 people with 10% of their yearly production of 1,000,000 braille pages produced manually by volunteers, 10% by automated braille printers, 16% by employees, and the remainder from zinc plates. The only computer was a 12 bit PDP 8-E with 8K of core memory.

We can, unfortunately, also be reminded of how far we have to go as far with respect to software when we consider the current cost of producing university-level braille textbooks or other braille documents where accuracy is essential. As an example, the Alternate Text Production Center associated with California community colleges charges approximately US$3.62 per embossed braille page for converting print to literary braille and US$6.67 per embossed braille page for converting mathematics or science with extra charges for handling graphics. (The cost of embossing alone is approximately US$0.60 per braille page.) Using the rule of thumb of four braille pages per print page, the cost for transcribing and embossing even a short 300-page mathematics textbook could easily exceed US$8000.


*However, preparation and modification of the [translation] tables can be done only by someone with a good knowledge of the way in which the translation algorithm works. It has been found in practice that anomalies in the translation can emerge over a long period of usage of the system, and minor changes to the tables must be made to correct these. If not done sufficiently skillfully, these changes may themselves introduce new miscontractions.

J. B. Humphreys, An Adaptive Braille Transcription System -- Progress Report, Braille Research Newsletter, No. 11, (Warwick Research Unit for the Blind) August 1980, p. 74.


This article, which is a revised and updated verion of an article related to the earlier Braille-in-Daisy project,
was first posted May 4, 2011
Slightly edited version posted May 6, 2011.

Contact author: info at dotlessbraille dot org