Multimedia titles use technology which offers a good basis for use by print-disabled people. However, lack of appropriate document structures and difficulties in navigating multimedia titles limits the readership. We propose a new approach to design of multimedia documents by separating contents from interaction techniques while using XML standards. From these XML-based documents and their mapping the final multimedia title can be generated using one of the industry standards.
Electronic documents are becoming more and more omnipresent as distribution of books, newspapers, manuals, dictionaries, encyclopaedias, edutainment software on CD-ROMs and DVDs grows. The immense amount of storage which can be hold by affordable disks contrasts with a demand for efficient and effective access to information stored in electronic form by everyone. The term eBook widens the concept of letters arranged into sentences and chapters towards a rich presentation of contents and - also for ease of reading - provided through a multiplicity of presentation media. Figure 1 compares current uses of multimedia applications.

Figure 1: Multimedia applications
eBooks are to be read on eBook readers - either PC-based software
or dedicated portable devices such as personal digital assistants (PDAs). Pioneered
by multimedia computer games on PCs and "playstations" can hardware
for audio playback, 3D rendering of sound, speech and images be packed into
portable and even wearable devices. A watch with MP3 audio player can be used
for listening to a recipe and a mobile phone with a video player can be a versatile
device for reading a gourmet guide. But selection of a title from a set of audio
tracks or selection of restaurant tips on a mobile phone remains to be time-consuming
and requires either visual or auditory control and sometimes even both. Lack
of control of multimedia documents imposes therefore a barrier for people, who
are blind or deaf. In the following we want to propose mechanisms and a new
design method for the information stored in multimedia titles in order to facilitate
integration of print-disabled people into the information society.
A good example for implications from lack of access to audio or graphics is an encyclopedia such as MS Encarta (Microsoft, 1999), which is otherwise a very successful multimedia title demonstrating the added value of multimedia. Encarta contains articles on subjects enhanced by pictorial material, films, and even audio. Many words in Encarta can be looked up for spelling and pronunciation within an dictionary.
For example is John F. Kennedy described and his life is illustrated by pieces of his inaugural speech as well as a video of his speech on the cuba crisis (Figure 2). In this entry of the encyclopedia the word "inaugural" can be looked up in the dictionary through a hypertext link.

Figure 2: Snapshot of an multimedia article
Access to the article on John F. Kennedy by deaf people requires to transcribe spoken text at least into captions and more preferable into sign language. Access to Encarta's text by blind people is partially possible (see Figure 3) by a screen reader for graphical user interfaces (Mynatt and Weber, 1994). Additionally it requires to verbalize the image of Mr. Kennedy and either to transcribe the phonetical encoding of the word "inaugural" into Braille code or alternatively to verbalize phonetic symbols for speech output.

Figure 3: Braille view of Figure 1‘s highlighted section
The example of an encyclopedia article demonstrates the need for enriched multimedia titles in order to address as many users as possible. An enriched multimedia title consists of redundant contents from which appropriate combinations for viewer’s needs can be selected. Current technology of digital versatile disks (DVDs) can be considered to be an enriched multimedia title, as it includes audio tracks for all languages besides the movie as well as subtitles written in each of these languages. DVDs may include more documents then the film alone, such as a description of the making of the film. We want to call a DVD an enriched multimedia title, if redundancy of media is provided for all parts of the production.
Enrichment of multimedia titles not only takes into account the user’s language. For blind people the addition of a braille version of text is necessary, as well as a spoken version of all text. For deaf people a movie of a signer is necessary to accomplish verbalized presentation.
For multimedia titles, services to transcribe from print into Braille or audio books and from spoken language into sign language should be seen no more independent from each other. Print-disabled people have been described as a group being excluded from access to written and printed material. Various services such as newspapers for the blind (Engelen et.al., 1995) or accessible browsers have been developed. Encarta has acknowledged this partially by allowing for three font sizes. But these otherwise hardly comparable user groups demonstrate the need for a new round in defining print-disability by identifying special needs in interactive systems namely eBook readers and the need for a more flexible design and encoding of multimedia titles.
The term Braille itself depends on the user’s language and moreover may be one of contracted braille (grade 2) or uncontracted braille (grade 1). Some braille readers even expect computer braille, which is a one-to-one mapping between characters shown on the screen and a braille character.
A common technique to avoid editing of a braille version is to generate braille through appropriate conversion software from a text version. However, for a simple phrase-based form of contracted braille or for computer braille, generation of braille through screen readers is an accepted approach. More successfully, a text version which includes sufficient mark-up information, can be used to automatically generate well-formed braille text using language specific converters.
Limitations exist for mark-up languages such as HTML as documents may include expression written in the programming language JavaScript. While HTML is a context-free language, does JavaScript add context-sensitive elements such as "mouseover" attributes which arbitrarily modify the presentation depending on the user’s input. Automatic transcription of such a document is limited to HTML elements.
Auditory media consist of verbal or non-verbal expressions. As audio is a time-dependent medium the temporal structure and granularity of speech (prosody) and of sound is important. Audio as part of movies consists of possibly large pauses. During these pauses further descriptions of the movie can be inserted. If audio is enriched by audio there may arise conflicts which result in a sequential arrangement of the presentation and hence may extend the time needed to listen to audio.
Besides speech playback offers speech synthesis a cost effective mechanism to generate verbal audio. Its quality can be improved considerably if Speech attributes such as pitch, gender, head size, etc. are taken into account. Just as braille can be transcribed from textual mark-up information can high quality synthesized speech be generated from text.
Lip reading and signing are two different methods of reading used by people who are deaf. Signing is a time dependent manual effort which depends on a particular language. The upper body of a signer and his/her limb’s movements are shown. Expression are formed which either correspond word by word with text or are a description of the meaning of the text using syntactic elements of a singing language. Figure 4 shows a sign for "Germany" (Hamburg University, 2001).

Figure 4: German sign language example for "Germany"
Synthesis of sign language is possible through avatars. An avatar is a placeholder of somebody within a virtual world. A signing avatar is taking only little of the screen space and is created through modern rendering algorithms for an animated human body (figure 5). In general, avatars are less accepted by deaf people, but future research may identify better transcriptions from text to signs (Pragma, 1999)

Figure 5: Multimedia presentation of "Chile" (Colorado University, 2001 and Wen, et.al. 2000)
A suggested by ISO 14915 is a multimedia title to be designed by structuring it into smaller units and by indication of parallelism and sequential constraints of each media stream. Figure 6 gives an example of such a design.

Figure 6: Designing multimedia titles (Sutclife, 1999)
The approach of ISO 14915 is very well suitable for rich multimedia titles as well. If for particular media streams alternatives have been incorporated then their duration depends on the duration of another interval. Hence rich multimedia titles require parallelism for particular media.
Even if users with some need for alternative presentations as described above are addressed then the structure of the multimedia title specifies clear intervals within which all presentations are to come to a common end. In case a particular medium has to extend the presentation because it is synthetically generated, it appears possible to lengthen the interval accordingly without degrading the quality of the design too much.
For example may a movie with subtitles be accomplished by a signing video. The signing my be chosen to be just long enough to fit with a scene of the movie. But more reasonably we can assume that it extends the scene’s interval to cover all of it’s contents. In this instance, the movie has to be paused (frozen) in order to ensure parallelism between the two different media streams.
Temporal dependencies of presentations are described in a standardized way through the Synchronous Multimedia Integration Language (SMIL) (W3C, 2001) which is similar to the Amsterdam Hypermedia Model (AHM, Hardman, et. al., 1994). Unlike HTML is SMIL concentrating less on visual details. Documents written in SMIL specify a screen region and various presentations together with their detailed temporal dependency as to be shown in these regions. For example may a radio station use SMIL to relate a song to a commercial advertisement for buying the corresponding audio CD. SMIL seems to be very well suitable for rich multimedia titles. Industry standards such a Flash, Quicktime or Windows Media Format, however, try to achieve the same but are less well specified. Moreover it appears possible to generate these file formats from a SMIL presentation. The recent development of SMIL 2.0 adds more scripting functionality in order to address interaction techniques. As described above for HTML this does not ease the task of converting SMIL documents.
The discussion so far does neither take into account the method of browsing through a rich multimedia title, nor the more general design of multimedia interaction objects used in edutainment titles. Already figure 2 contains such an multimedia interaction object. It consists of buttons and sliders to control a movie’s presentation.
Like SMIL and (X)HTML are a particular XML document type definition (DTDs) there are DTDs available to describe interaction objects. XUL is being defined within the Mozilla project (Netscape, 1999), VoiceML is a W3C standard (W3C, 2001) for voice input driven interaction. These interaction objects become usable through an appropriate screen reader for all users mentioned so far.
The integration of interaction techniques within temporally dependent media such as audio, animations, etc. introduces an complexity that makes it difficult to add to multimedia titles subtitles, signing videos, explanations of scenes, etc. Successful integration of interaction techniques with time dependent media requires coherence among the media streams. We define
Coherence is a quality of the perception of multiple time-dependent presentations of interaction techniques, which control their temporal granularity.
Designing multimedia titles for coherence requires an explicit representation of both the media’s temporal granularity as well as the interaction technique’s temporal constraints. Figure 7 shows an multimedia title’s user interface which is generated from two different XML-based specifications. The complexity of the user interface is higher than the complexity which can be achieved by using a XML DTDs alone. A mapping between two DTDs possibly can be context-sensitive.

Figure 7: Mapping contents and interaction techniques
Using state of the art approaches technically such a mapping can be based
The implications of such an approach are wide ranging as technological requirements may be taken into account while constructing the multimedia document or in a separate process.
Project MultiReader is a new European Union funded project which will investigate the possibilities of using multimedia documents for use by mainstream readers as well as print-disabled readers (blind, partially sighted, deaf and dyslexic). Therefore the project follows three approaches:
The reading systems will be developed using user-centered design and "design for all" methodologies, advised by panels consisting of readers with specific needs.
As the required interaction techniques vary from the type of reader, the navigation facilities of eBooks have to be adjusted (Petrie et.al., 1999). This has been successfully demonstrated for the generation of user interfaces of an HTML web browser (Emiliani, 2000) and for kiosk systems (Vanderheiden and Henry, 2000). A new approach is needed whereby not only the static textual contents is re-arranged but also time-dependent media such as animation, video and audio can be presented in navigable units. The challenge thereby is to ensure coherence between contents of an eBook and transcriptions of inaccessible media (Weber, 2000).