Working Draft 13 March 2001

This version:

http://www.interface.computing.edu.au/documents/VHML/2001/WD-VHML-20010313

Latest version:

http://www.interface.computing.edu.au/documents/VHML

Previous version:

http://www.interface.computing.edu.au/documents/VHML

Editors:

Andrew Marriott

Simon Beard

John Stallo

Quoc Huynh

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the Curtin InterFace Website.

This is the 9^th March 2001 Working Draft of the "Virtual Human Markup Language Specification".

This working draft relies on several other standards - the various sub-languages of VHML use and extend these standards.

Abstract

This document describes a Virtual Human Markup Language. The language is designed to accommodate the various aspects of Human-Computer Interaction with regards to Facial Animation, Body Animation, Dialogue Manager interaction, Text to Speech production, Emotional Representation plus Hyper and Multi Media information. [Input here: am I missing any required sub-system?]

It will use / build on existing (de facto) standards such as those specified by the W3C Voice Browser Activity, and will describe new languages to accommodate functionality that is not catered for.

The language will be XML/XSL based and will consist of the following sub-systems:

DMML Dialogue Manager Markup Language (W3C Dialogue Manager or AIML)
FAML Facial Animation Markup Language [Any existing standard?]
BAML Body Animation Markup Language [Any existing standard?]
SML Speech Markup Language (SSML / Sable)
EML Emotion Markup Language
HTML HyperText Markup Language [ or subset only?]

The language will use XML Namespaces for inheritance of existing standards.

Although general in nature, the intent of this language is to facilitate the natural and realistic interaction of a Talking Head or Talking Human with a user via a Web page or application. One specific intended use can be found in the deliverables of the Interface project (http://www.ist-interface.org/).

Figure 1 The user->Dialogue Manager->user data flow

Table of Contents

Status of this Document 1

Abstract 2

Terminology and Design Concepts 6

Rendering Processes 6

Document Generation, Applications and Contexts 8

The Language Structure 10

Virtual Human Markup Language (VHML) 11

Root Element 11

vhml 11

Miscellaneous Elements 11

embed 11

Emotion Markup Language (EML) 12

Emotions 12

Emotion Default Attributes 12

Notes: 12

anger 13

joy == happy 13

neutral 13

sadness 14

fear 14

disgust 14

surprise 14

dazed 15

confused 15

bored 15

Other Virtual Human Emotional Responses 16

Notes: 16

agree 17

disagree 17

emphasis 18

smile 18

shrug 19

Emotional Markup Language Examples 19

Facial Animation Markup Language (FAML) 20

Emotion Default Attributes 20

Direction/Orientation 20

Notes 20

anger 22

joy == happy 22

neutral 22

sadness 22

fear 22

disgust 22

surprise 22

confused 22

bored 22

look_left 23

look_right 23

look_up 23

look_down 23

head_left 24

head_right 24

head_up 24

head_down 24

eyes_left 25

eyes_right 25

eyes_up 25

eyes_down 25

head_left_roll 26

head_right_roll 26

EyeBrows 27

Notes: 27

eyebrow_up 28

eyebrow_down 28

eyebrow_squeeze 28

Blinks/Winks 29

Notes 29

blink 29

double_blink 29

left_wink 30

right_wink 30

Hyper Text Markup Language (HTML) 31

Body Animation Markup Langauge (BAML) 33

anger 34

joy == happy 34

neutral 34

sadness 34

fear 34

disgust 34

surprise 34

confused 34

bored 34

Dialogue Manager Markup Language (DMML) 35

Dialogue Manager Response 35

List of DMML elements: 35

Recognised variable names 36

Speech Markup Language (SML) 38

Speech markup Language default Attributes 38

xml:lang 39

anger 40

joy == happy 40

neutral 40

sadness 40

fear 40

disgust 40

surprise 40

confused 40

bored 40

p == paragraph 41

s == sentence 41

say-as 42

phoneme 44

voice 45

emphasis 47

break 47

prosody 48

audio 49

mark 50

emphasise_syllable == emphasize_syllable 51

pause 52

pitch 52

Conformance 53

Conforming Virtual Human Markup Document Fragments 53

Conforming Stand-Alone Virtual Human Markup Language Documents 53

Conforming Virtual Human Markup Language Processors 53

The Rendering 55

References 56

Acknowledgements 57

Terminology and Design Concepts

The design and standardization process has adopted the approach of the Speech Synthesis Markup Requirements for Voice Markup Languages published December 23, 1999 by the W3C Voice Browser Working Group.

The following items were the key design criteria.

Consistency: provide predictable control of rendering output across platforms and across speech synthesis implementations.
Interoperability: support use along with other W3C specifications including (but not limited to) the Dialog Markup Language, Audio Cascading Style Sheets and SMIL, etc.
Generality: support rendering output for a wide range of applications with varied graphics capability and speech content.
Internationalization: Enable speech output in a large number of languages within or across documents.
Generation and Readability: Support automatic generation and hand authoring of documents. The documents should be human-readable.
Implementable: The specification should be implementable with existing, generally available technology and the number of optional features should be minimal.

Rendering Processes

A rendering system that supports the Virtual Human Markup Language will be responsible for rendering a document as visual and spoken output and for using the information contained in the markup to render the document as intended by the author.

Document creation: A text document provided as input to the system may be produced automatically, by human authoring, or through a combination of these forms. The Virtual Human Markup Language defines the form of the document.

Document processing: The following are the nine major processing steps undertaken by a VHML system to convert marked-up text input into automatically generated output. The markup language is designed to be sufficiently rich so as to allow control over each of the steps described below so that the document author (human or machine) can control or direct the final rendered output of the Virtual Human.

XML Parse: An XML parser is used to extract the document tree and content from the incoming text document. The structure, tags and attributes obtained in this step influence each of the following steps.
Culling of un-needed VHML tags: For example, at this stage any tags which produce audiowhen the final rendering device/environment does not support audio may be removed. Similarly for other tags. It should be noted that since the timing synchronisation is based upon vocal production, the spoken text may need to be processed regardless of the output device's capabilities.
Structure analysis: The structure of a document influences the way in which a document should be read. For example, there are common speaking and acting patterns associated with paragraphs and sentences.

- Markup support: Various elements defined in the VHML markup language explicitly indicate document structures that affect the visual and spoken output.

- Non-markup behavior: In documents and parts of documents where these elements are not used, the VHML system is responsible for inferring the structure by automated analysis of the text, often using punctuation and other language-specific data. [How good could we make this?]

Text normalization: All written languages have special constructs that require a conversion of the written form (orthographic form) into the spoken form. Text normalization is an automated process of the TTS system that performs this conversion. For example, for English, when "$200" appears in a document it may be spoken as "two hundred dollars". Similarly, "1/2" may be spoken as "half", "January second", "February first", "one of two" and so on.

- Markup support: The "say-as" element can be used in the input document to explicitly indicate the presence and type of these constructs and to resolve ambiguities. The set of constructs that can be marked includes dates, times, numbers, acronyms, current amounts and more. The set covers many of the common constructs that require special treatment across a wide number of languages but is not and cannot be a complete set.

- Non-markup behavior: For text content that is not marked with the say-as element the TTS system is expected to make a reasonable effort to automatically locate and convert these constructs to a speakable form. Because of inherent ambiguities (such as the "1/2" example above) and because of the wide range of possible constructs in any language, this process may introduce errors in the speech output and may cause different systems to render the same document differently.[What is the BAP equivalent of this text normalisation?]

Text-to-phoneme conversion: Once the system has determined the set of words to be spoken it must convert those words to a string of phonemes. A phoneme is the basic unit of sound in a language. Each language (and sometimes each national or dialect variant of a language) has a specific phoneme set: e.g. most US English dialects have around 45 phonemes. In many languages this conversion is ambiguous since the same written word may have many spoken forms. For example, in English, "read" may be spoken as "reed" (I will read the book) or "red" (I have read the book).

Another issue is the handling of words with non-standard spellings or pronunciations. For example, an English TTS system will often have trouble determining how to speak some non-English-origin names; e.g. "Tlalpachicatl" which has a Mexican/Aztec origin.

- Markup support: The "phoneme" element allows a phonemic sequence to be provided for any word or word sequence. This provides the content creator with explicit control over pronunciations. The "say-as" element may also be used to indicate that text is a proper name that may allow a TTS system to apply special rules to determine a pronunciation.

- Non-markup behavior: In the absence of a "phoneme" element the TTS system must apply automated capabilities to determine pronunciations. This is typically achieved by looking up words in a pronunciation dictionary and applying rules to determine other pronunciations. Most TTS systems are expert at performing text-to-phoneme conversions so most words of most documents can be handled automatically.

Prosody analysis: Prosody is the set of features of speech output that includes the pitch (also called intonation or melody), the timing (or rhythm), the pausing, the speaking rate, the emphasis on words and many other features. Producing human-like prosody is important for making speech sound natural and for correctly conveying the meaning of spoken language.

- Markup support: The "emphasis" element, "break" element and "prosody" element may all be used by document creators to guide the TTS system is generating appropriate prosodic features in the speech output.

- Non-markup behavior: In the absence of these elements, TTS systems are expert (but not perfect) in automatically generating suitable prosody. This is achieved through analysis of the document structure, sentence syntax, and other information that can be inferred from the text input.

Waveform production: The phonemes and prosodic information are used by the TTS system in the production of the audio waveform. There are many approaches to this processing step so there may be considerable platform-specific variation.

Markup support: The TTS markup does not provide explicit controls over the generation of waveforms. The "voice" element allows the document creator to request a particular voice or specific voice qualities (e.g. a young male voice). The "audio" element allows for insertion of recorded audio data into the output stream.

Facial and BodyAnimation production: Timing information will be used to synchronise the spoken text with facial gestures and expressions as well aswith body movements and gestures.
Rendering the multiple streams (Audio, Graphics, Hyper and Multi Media) onto the output device(s). XSL Transformation - here or in the earlier stage?

[Need info about the FAP and BAP production in here]

Document Generation, Applications and Contexts

There are many classes of document creator that will produce marked-up documents to be spoken by a VHML system. Not all document creators (including human and machine) have access to information that can be used in all of the elements or in each of the processing steps described in the previous section. The following are some of the common cases.

The document creator has no access to information to mark up the text. All processing steps in the VHML system must be performed fully automatically on raw text. The document requires only the containing "vhml" element to indicate the content is to be rendered.
When marked text is generated programmatically the creator may have specific knowledge of the structure and/or special text constructs in some or all of the document. For example, an email reader can mark the location of the time and date of receipt of email. Such applications may use elements that affect structure, text normalization, prosody, possibly text-to-phoneme conversion, as well as facial or body gestures to gain the user's attention.
Some document creators make considerable effort to mark as many details of the document to ensure consistent speech quality across platforms and to more precisely specify output qualities. In these cases, the markup may use any or all of the available elements to tightly control the visual or speech output.
The most advanced document creators may skip the higher-level markup (Emotions, Facial and body animation tags) and produce low-level VHML markup for segments of documents or for entire documents.

It is important that any XML elements or tags that are part of VHML use existing tags specified in existing (de facto) or developing standards (for example such as HTML or SSML). This will aid in minimising learning curves for new developers as well as maximising opportunites for th emigration of legacy data.

The Language Structure

Figure 2 The VHML Language Structure

VHML uses the languages shown in Figure2 tofacilitate the direction of a Virtual human interacting with a user via a Web page or stand alone application. In response to a user enquiry, the Virtual human will have to react in a realistic and humane way using appropriate words, voice, facial and body gestures. For example, a Virtual Human that has to give some bad news to the user - "I'm sorry Dave, I can't find that file you want." - mayspeak in a sad way, with a sorry face and with a bowed body stance. In a similar way, a different message may be delivered with a happy voice, a smiley face and with a lively body.

The following sections detail the individual XML based languages which make this possible through VHML.

Virtual Human Markup Language (VHML)

Root Element

The Virtual human Markup Language is an XML application. The root element is vhml. See the section on Conformance.

<?xml version="1.0"?>

<vhml>

... the body ...

</vhml>

vhml

Description:

Root element that encapsulates all other vhml elements.

Attributes: none.

Properties: root node, can only occur once.

Example:

<vhml>

<p>

<happy>

The vhml element encapsulates all other elements

</happy>

</p>

</vhml>

Notes: Should we allow <viewset> and <view> a la <frame> and <frameset>? This would allow multiple rendered scenes plus a Virtual Human with an HTML page for hyper information.

Miscellaneous Elements

embed

Description:

Gives the ability to embed foreign file types within a VHML document such as sound files, MML files etc., and for them to be processed appropriately.

Attributes:

Name

Description

Values

type

Specifies the type of file that is being embedded. (Required)

audio - embedded file is an audio file.

mml - an mml file is embedded.

[What values should we have here?]

src

Gives path to audio file. (Required)

A character string.

Properties: empty.

Example:

Emotion Markup Language (EML)

Emotions

The following elements will affect the emotion shown by the Virtual Human. These elements will affect the voice, face and body.

Emotion Default Attributes

Each element has at least 3 attributes associated with it:

Name	Description	Values	Default
intensity	This value ranges from 0 to-100 and represents a percentage value of the maximum intensity of that particular facial gesture, expression or emotion.	0 - 100	100
duration	The duration value represents the time span in seconds or milliseconds that the element expression, gesture or emotion will persist in the Virtual Human animation.	A numeric value representing time (conforms to Times attribute from CSS specification ).	Until closing element
mark	This attribute can be used to set an arbitrary mark at a given place in the text, so that, for example, an engine can report back to the calling application that it has reached the given location.	Character-string identifier for this tag.	No default - optional attribute

Notes:

EML emotion elements can be placed in sequence to produce a seamless flow from one emotion to the other. Emotion elements can also be blended together at the same instance to produce different expressions and emotions entirely, as desired.

[How would we do this? Contribution attributes which are combined to produce 100% emotion? No contribution value means 100% of that emotion?]

OTHER EMOTIONS?????

Should the TAG names be nouns (sadness, anger) or verbs (sad, angry)?

Should we also allow subjective durations - short, medium, long - similar to the pause element?

anger

Description:

Simulates the effect of anger on the rendering (i.e. generates a Virtual Human that looks and sounds angry).

Attributes: Default EML Attributes.

Properties: Can contain other non-emotion elements.

Example:

<anger>

I would not give you the time of day

</anger>

joy == happy

Description:

Simulates the effect of happiness on the rendering (i.e. generates a Virtual Human that looks and sounds joyful).

Attributes: Default EML Attributes. .

Properties: Can contain other non-emotion elements.

Example:

<joy>

I have some wonderful news for you.

</joy>

neutral

Description:

Gives a neutral intonation to the Virtual Human's appearance and sound..

Attributes: Default EML Attributes. .

Properties: Can contain other non-emotion elements.

Example:

I can sometimes sound non-commital like this.

</neutral>

sadness

Description:

Simulates the effect of sadness on the rendering (i.e. generates a Virtual human that looks and sounds sad).

Attributes: Default EML Attributes. .

Properties: Can contain other non-emotion elements.

Example:

Honesty is hardly ever heard.

</sadness>

fear

Description:

Simulates the effect of fear on the rendering (i.e.generates a Virtual Human that looks and sounds afraid).

Attributes: Default EML Attributes. .

Properties: Can contain other non-emotion elements.

Example:

<fear>

I am afraid of flying.

</fear>

disgust

Description:

Simulates the effect of disgust on the rendering (i.e.

generates a Virtual Human that looks and sounds disgusted).

Attributes: Default EML Attributes. .

Properties: Can contain other non-emotion elements.

Example:

How could you eat Roquefort cheese!

</disgust>

surprise

Description:

Simulates the effect of surprise on the rendering (i.e.

generates a Virtual Human that looks and sounds surprised).

Attributes: Default EML Attributes. .

Properties: Can contain other non-emotion elements.

Example:

I did not expect to find that in my lasagne!

</surprise>

dazed

Description:

Simulates the effect of being dazed on the rendering (i.e.

generates a Virtual Human that looks and sounds dazed).

Attributes: Default EML Attributes. .

Properties: Can contain other non-emotion elements.

Example:

<dazed>

Did you get the number of that truck?

</dazed>

confused

Description:

Simulates the effect of confusion on the rendering (i.e.

generates a Virtual Human that looks and sounds confused).

Attributes: Default EML Attributes. .

Properties: Can contain other non-emotion elements.

Example:

If this is Tuesday, then this must be Linköping.

</confused>

bored

Description:

Simulates the effect of boredom on the rendering (i.e.

generates a Virtual Human that looks and sounds bored).

Attributes: Default EML Attributes. .

Properties: Can contain other non-emotion elements.

Example:

<bored>

Writing specifications is real fun.

</bored>

Other Virtual Human Emotional Responses

The following elements will accommodate other well known human emotional reactions. These will affect the voice, face and body of the Virtual Human.

[Should these be EML?]

Notes:

1: The timing is such that the action is performed at the place where the element is (i.e.depends on what has been spoken/acted out before this element is met.) This must take into account Text Normalisation differences between what the text is and what is actually spoken.

A <smile intensity="50" duration="5000/>

little dog goes into

<head_left_roll intensity="40" duration="1200"/> <agree intensity="30" duration="1200"/>

a saloon in the Wild West, and

<head_right_roll intensity="60" duration="1000"/> <agree intensity="30" duration="1000"/>

<head_left intensity="40" duration="1000"/> beckons to the bartender.

2: These elements also have intensity and duration attributes as for the EML elements. The duration must be specified.

agree

Description:

The agree element animates a nod of the Virtual Human. The agree element animation is broken into two sections: the head raise and then the head lower.

Observations have shown that there is a raise of the head before the nod is initiated. The agree element mimics this and 10 percent of the duration for the agree element is allocated for the head raise, with an intensity of 10 percent of the authored intensity value; the other 90 percent is allocated to the head lower.

The agree element can typically be used to gesture "yes" or "agreement". Only the vertical angle of the head is altered during the element animation, the eye gaze is still focused forward.

[Body animation for this element?]

[Should % be an attribute?]

Attributes: Default EML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

That's certainly <agree duration="1000"/>right Olly.

disagree

Description:

The disagree element animates a shake of the head. The element animates two shakes, a single shake is considered to be a head movement from the left to the right.

The disagree element can be used as a facial gesture for "no" or "disagree".

The element only affects the horizontal displacement of the head and no other facial features are affected.

Animation involves moving first to the left, then right, repeated and then returning to the central plane.

[Body animation for this element?]

[Other attributes? - # of shakes, left or right first?]

Attributes: Default EML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

I <disagree duration="2000"/> will not have that smelly cheese on my spaghetti

emphasis

Description:

The emphasis element is very similar in animation to the agree element. The difference being the emphasis element incorporates a lowering of the eyebrow into the nod itself as described by Pelachaud and Prevost (1995). This serves to further emphasize or accentuate words in the spoken text.

The emphasis element similarly has raise and lower stages as found in the agree element animation. It is noted however that the eyebrow are lowered at the same rate as the nod and if a different intensity of eyebrow lowering is needed the emphasis element can be used in conjunction with the brow_down element to produce an emphasis animation with a greater lowering of the eyebrow or a more subtle one.

[Body animation for this element?]

Attributes: Default EML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

I <emphasis duration="500"/> will not buy this record, it is scratched.

smile

Description:

The smile element, as the name suggest animates the expression of a smile into the Talking Head animation.

The mouth is widened and the corners pulled back towards the ears. The larger the intensity value for the smile element, the greater the intensity of the smile. However a value too large, produces a rather "cheesy" looking grin and can look disconcerting or phony. This however can be used to the animator's advantage, if a mischievous grin or masking smile is required.

The smile element is generally used to start sentences and is used quite often when accentuating positive or cheerful words in the spoken text (Pelachaud and Prevost, 1995).

Attributes: Default EML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<smile duration="5000"/> Potatoes must be almost as good as chocolate to eat!

shrug

Description:

The shrug element animation mimics the facial and body expression "I don't know".

A facial shrug consists of the head tilting back, the corners of the mouth pulled downward and the inner eyebrow tilted upwards and squeezed together.

A body shrug consists of [INFO needed here please.]

Attributes: Default EML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<shrug duration="5000"/>I neither know nor care!

Emotional Markup Language Examples

<?xml version="1.0"?>

<!DOCTYPE vhml SYSTEM "./vhml-v01.dtd">

<vhml>

<p>

<happy>I have some wonderful news for you</happy>

<neutral>I am saying this in a neutral voice</neutral>

<sad>I can not come to your party tomorrow</sad>

</p>

</vml>

Facial Animation Markup Language (FAML)

Emotion Default Attributes

Each element has at least 3 attributes associated with it:

Name	Description	Values	Default
intensity	This value ranges from 0 to-100 and represents a percentage value of the maximum intensity of that particular facial gesture, expression or emotion.	0 - 100	100
duration	The duration value represents the time span in milliseconds that the element expression, gesture or emotion will persist in the Virtual Human animation.	A numeric value representing time in milliseconds.	Must be specified
mark	This attribute can be used to set an arbitrary mark at a given place in the text, so that, for example, an engine can report back to the calling application that it has reached the given location.	Character-string identifier for this tag.	No default - optional attribute

Direction/Orientation

The following elements affect the direction or orientation of the head and the eyes (directions are wrt Talking Head).

The animation of the head movement can be broken down into three main parts: pitch, yaw and roll.

The pitch affects the elevation and depression of the head in the vertical field. The yaw affects the rotational angle of the head in the horizontal field and roll affects the axial angle. The combination of these three factors allow full directional movement for the animation of the Talking Head.

Notes

1: There are 12 main elements that control and animate the direction and orientation of the Talking Head. [Should we have independent eye/head movement?]

2: It is noted that the eyes and head move at the same rate during the animation of the looking elements.

3: All combinations of the above directional elements allow the head to have full range of orientation. A combination of the <look_left/> and <look_up/> elements will enable the head to look to the top left in the animation sequence, whilst <look_right/> <look_down/> will enable the head to look to the bottom right.

4: The eye_xxx directional elements allow four independent directions for eye movement. This entails movement in the vertical and horizontal planes. As with head directional elements, the elements can be combined together to provide full range of eye gaze even those not humanly possible. It is however noted that the eyes cannot be animated independently of each other. [ Is this a problem???? We could use the which attribute of eyebrow_up]

anger

Description:

Inherited from EML.

joy == happy

Description:

Inherited from EML.

neutral

Description:

Inherited from EML.

sadness

Description:

Inherited from EML.

fear

Description:

Inherited from EML.

disgust

Description:

Inherited from EML.

surprise

Description:

Inherited from EML.

dazed

Description:

Inherited from EML.

confused

Description:

Inherited from EML.

bored

Description:

Inherited from EML.

look_left

Description:

Turns both the eyes and head to look left.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<look_left duration="1000"/>Cheese to the left of me!

look_right

Description: Turns both the eyes and head to look right.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<look_right duration="800"/>Cheese to the right of me!

look_up

Description:

Turns both the eyes and head to look up.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<look_up duration="5000"/>Dear God, is there no escaping this smelly cheese?

look_down

Description:

Turns both the eyes and head to look down.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<look_down duration="1000"/>Perhaps it is just my feet!

head_left

Description:

Only the head turns left, the eyes remain looking forward.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<head_left duration="2000" intensity="30"/>What, no potatoes?

head_right

Description:

Only the head turns right, the eyes remain looking forward.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<head_right duration="100"/>Where is the chocolate?

head_up

Description:

Only the head turns upward, the eyes remain looking forward.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<head_up intensity="100" duration="1000"/>You are an insolent swine!

head_down

Description:

Only the head turns downward, the eyes remain looking forward.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<head_down duration="2500"/>Are you happy now?

eyes_left

Description:

Only the eyes turn left, the head remains looking forward.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<eyes_left duration="1000"/>There is the door, please use it.

eyes_right

Description:

Only the eyes turn right, the head remains looking forward.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<eyes_right duration="1000"/>Stand still laddie!

eyes_up

Description:

Only the eyes turn upward, the head remains looking forward.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<eyes_up intensity="75" duration="1000"/>Not that turnip!

eyes_down

Description:

Only the eyes turn downward, the head remains looking forward.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<eyes_down duration="1000"/>Sorry seems to be the hardest word.

head_left_roll

Description:

The roll element animates the roll of the Talking Head in the axial plane. Roll, although subtle in normal movement, is essential for realism.

This element allows the author to script roll movement in the Talking Head, typically in conjunction with other elements, such as nodding and head movements, to add further realism to the Talking Head.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<head_left_roll duration="1000"/>Way over yonder.

head_right_roll

Description:

The roll element animates the roll of the Talking Head in the axial plane. Roll, although subtle in normal movement, is essential for realism.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<head_right_roll duration="800"/>What a strange sight!

EyeBrows

Notes:

1: The eyebrow movement element enables the author to script certain eyebrow movements to accentuate words or phrases. MPEG-4 separates the eyebrow into 3 regions, inner, middle and outer. The eyebrow elements affect all three regions of the eyebrow to animate movement.

[individual sections to be moved independently???]

[Should we mention MPEG-4?]

eyebrow_up

Description :

vertical eyebrow movement upwards.

Attributes: Default FAML Attributes.

duration must have a value.

Name

Description

Values

Default

which

which eyebrow to move

both

right

left

both

Properties: none (Atomic element).

Example:

<eyebrow_up which="left" duration="1000"/> Fascinating Captain.

eyebrow_down

Description:

vertical eyebrow movement downwards.

Attributes: Default FAML Attributes.

duration must have a value.

Name

Description

Values

Default

which

which eyebrow to move

both

right

left

both

Properties: none (Atomic element).

Example:

<eyebrow_down duration="1000"/>I am not happy with you!

eyebrow_squeeze

Description:

Squeezing of the eyebrow together.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<eyebrow_squeeze duration="1000"/>Oooh, that's difficult.

Blinks/Winks

Notes

blink

Description:

The blink element animates a blink of both eyes in the Talking Head animation.

The blink element only affects the upper and lower eyelid facial features of the head. By alternating the intensity value, the amount of eye closure is affected in the animation. An intensity value of 50 denotes 50 percent of the max amplitude for the blinking element, and as such the animation would only reflect half blinking where only half of the eyeball is covered.

Attributes: Default FAML Attributes.

duration must have a value.

[Attributes for left/right start time?]

Properties: none (Atomic element).

Example:

He gave a <blink intensity="10" duration="500"/> blink, then a <right_wink duration="500"/> wink and laughed.

double_blink

Description:

Not all blinks in humans are singular. Observation has shown that double blinking is quite common and can precede changes in emotion or denote sympathetic output.

Attributes: Default FAML Attributes.

duration must have a value.

[Attributes for left/right start time?]

Properties: none (Atomic element).

Example:

<double_blink duration="20"/>What a surprise!!

left_wink

Description:

Animates a wink of the left eye. The wink is not just the blinking of one eye, but the head pitch, roll and yaw is affected as well as the outer eyebrow and cheek. The combination of these animated features add to the realism of the wink itself.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

Nudge, nudge, <left_wink duration="500"/> wink,

<left_wink duration="2000"/>wink.

right_wink

Description:

Animates a wink of the right eye. the wink is not just the blinking of one eye, but the head pitch, roll and yaw is affected as well as the outer eyebrow and cheek. The combination of these animated features add to the realism of the wink itself.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

Nudge, nudge, <left_wink duration="500"/> wink,

<right_wink duration="2000"/>wink.

Hyper Text Markup Language (HTML)

[Should we translate HTML into the ACSS as shown or only allow a minimum subset of well formed HTML?]

H1, H2, H3,

H4, H5, H6 { voice-family: paul, male; stress: 20; richness: 90 }

H1 { pitch: x-low; pitch-range: 90 }

H2 { pitch: x-low; pitch-range: 80 }

H3 { pitch: low; pitch-range: 70 }

H4 { pitch: medium; pitch-range: 60 }

H5 { pitch: medium; pitch-range: 50 }

H6 { pitch: medium; pitch-range: 40 }

LI, DT, DD { pitch: medium; richness: 60 }

DT { stress: 80 }

PRE, CODE, TT { pitch: medium; pitch-range: 0; stress: 0; richness: 80 }

EM { pitch: medium; pitch-range: 60; stress: 60; richness: 50 }

STRONG { pitch: medium; pitch-range: 60; stress: 90; richness: 90 }

DFN { pitch: high; pitch-range: 60; stress: 60 }

S, STRIKE { richness: 0 }

I { pitch: medium; pitch-range: 60; stress: 60; richness: 50 }

B { pitch: medium; pitch-range: 60; stress: 90; richness: 90 }

U { richness: 0 }

A:link { voice-family: harry, male }

A:visited { voice-family: betty, female }

A:active { voice-family: betty, female; pitch-range: 80; pitch: x-high }

Body Animation Markup Langauge (BAML)

[Input here please, what extra markup is needed?]

1: Movement

2: Stance

3: Uses EML

4: Gestures

anger

Description:

Inherited from EML.

joy == happy

Description:

Inherited from EML.

neutral

Description:

Inherited from EML.

sadness

Description:

Inherited from EML.

fear

Description:

Inherited from EML.

disgust

Description:

Inherited from EML.

surprise

Description:

Inherited from EML.

dazed

Description:

Inherited from EML.

confused

Description:

Inherited from EML.

bored

Description:

Inherited from EML.

Dialogue Manager Markup Language (DMML)

Dialogue Manager Response

This language covers the Dialogue Manager's response only, not the pattern matching or the overall Knowledge base format.

[Since this work has already begun we need to talk about a preferred subset of AIML that can be used for the DMML.]

Therefore, the AIML tags,

<alice></alice> root element of Alice

<category></category> categorization of an Alice topic.

<pattern></pattern> the user input pattern.

<template>XXXX<template> the marking of the DM's response

are not part of DMML.

The XXXX in the above is covered by DMML.. For example, in the Alice fragment:

My name is <getvar name="botname"/>.

What is your name?

</template>

the DMML would handle the plain text "My name is ", the XML element "<getvar name="botname"/>" and the trailing text ". What is your name?".

List of DMML elements:

<star/> indicates the input text fragment matching the pattern '*' or '_'.

<that></that> If previous bot reply matches the THAT this event is fired.

<person2> X </person2> change X from 1st to 2nd person

<person> X </person> exchange 1st and 3rd person

<srai> X </srai> calls the pattern matches recursively on X.

<random> <li>X1</li><li>X2</li> </random> Say one of X1 or X2 randomly

<system>X</system> tag to run the shell command X

<think> X </think> tag pair is to evaluate the AIML expression X, but "nullify" or hide the result from the client reply.

<gossip> X </gossip> Save X as gossip.

and

Recognised variable names

The recognised variable names are:

preferred legacy equivalent name deprecated Atomic tag

DMbirthplace botbirthplace <birthplace/>

DMbirthday botbirthday <birthday/>

DMmaster botmaster <botmaster/>

DMboyfriend botboyfriend <boyfriend/>

DMband botband <favorite_band/>

DMbook botbook <favorite_book/>

DMcolor botcolor <favorite_color/>

DMfood botfood <favorite_food/>

DMmovie botmovie <favorite_movie/>

DMsong botsong <favorite_song/>

DMfun botfun <for_fun/>

DMfriends botfriends <friends/>

DMgender botgender <gender/>

DMgirlfriend botgirlfriend <girlfriend/>

DMmusic botmusic <kind_music/>

DMlooks botlooks <look_like/>

DMname botname <name/>

DMsize botsize <getsize/>

question <question/>

name <getname/>

topic <gettopic/>

age <get_age/>

gender <get_gender/>

has <get_has/>

he <get_he/>

ip <get_ip/>

it <get_it/>

location <get_location/>

she <get_she/>

they <get_they/>

we <get_we/>

dialogueManagerName

dialogueManagerwhoami

dialogueManagerGender

dialogueManagerHisHer

dialogueManagerHimHer

dialogueManagerHeShe

dialogueManagerMaster

dialogueManagerBirthPlace

dialogueManagerBirthDay

dialogueManagerAge

dialogueManagerDescription

dialogueManagerFavouriteColour

dialogueManagerFavouriteSport

dialogueManagerFavouriteFood

dialogueManagerFavouritePainter

dialogueManagerFavouriteArtist

dialogueManagerFavouriteBook

dialogueManagerFavouriteMovie

dialogueManagerFavouriteMusic

dialogueManagerFavouriteSong

dialogueManagerFavouriteAlbum

dialogueManagerPurpose

dialogueManagerHomeURL

Speech Markup Language (SML)

The following list is a description of each of SML's elements. As with any XML element, all SML elements are case sensitive; therefore, all SML elements must appear in lower case, otherwise they will be ignored.

Speech markup Language default Attributes

Name	Description	Values	Default
mark	This attribute can be used to set an arbitrary mark at a given place in the text, so that, for example, an engine can report back to the calling application that it has reached the given location.	Character-string identifier for this tag.	No default - optional attribute

xml:lang

Description:

Following the XML convention, languages are indicated by an xml:lang attribute on the enclosing element with the value following RFC 1766 to define language codes. Language information is inherited down the document hierarchy, i.e. it has to be given only once if the whole document is in one language, and language information nests, i.e. inner attributes overwrite outer attributes.

Example:

<paragraph>I don't speak Japanese.</paragraph>

Nihongo-ga wakarimasen.

</paragraph>

</vhml>

Notes:

1: The speech output platform determines behavior in the case that a document requires speech output in a language not supported by the speech output platform. This is currently only one of two allowed exceptions to the conformance criteria.

2: There may be variation across conformant platforms in the implementation of xml:lang for different markup elements. A document author should beware that intra-sentential language changes may not be supported on all platforms.

3: A language change often necessitates a change in the voice. Where the platform does not have the same voice in both the enclosing and enclosed languages it should select a new voice with the inherited voice attributes. Any change in voice will reset the prosodic attributes to the default values for the new voice of the enclosed text. Where the xml:lang value is the same as the inherited value there is no need for any changes in the voice or prosody.

4: All elements should process their contents specific to the enclosing language. For instance, the phoneme, emphasis and break element should each be rendered in a manner that is appropriate to the current language.

5: Unsupported languages on a conforming platform could be handled by specifying nothing and relying on platform behavior, issuing an event to the host environment, or by providing substitute text in the Markup Language.

[Should this be for all markups? Body Language as well?]

anger

Description:

Inherited from EML.

joy == happy

Description:

Inherited from EML.

neutral

Description:

Inherited from EML.

sadness

Description:

Inherited from EML.

fear

Description:

Inherited from EML.

disgust

Description:

Inherited from EML.

surprise

Description:

Inherited from EML.

dazed

Description:

Inherited from EML.

confused

Description:

Inherited from EML.

bored

Description:

Inherited from EML.

p == paragraph

Description:

Element used to divide text into paragraphs. Can only occur directly within a vhml element. The p element wraps emotion elements.

Attributes: none.

Properties: Can contain all other elements, except itself and vhml.

Example:

<p>

<sad>Today it's been raining all day,</sad>

<happy>

But they're calling for sunny skies tomorrow.

</happy>

</p>

Notes:

1: For brevity, the markup supports <p> as an exact equivalent of <paragraph>. (Note: XML requires that the opening and closing elements be identical so <p> text </paragraph> is not legal.).

2: The use of paragraph elements is optional. Where text occurs without an enclosing paragraph element the speech output system should attempt to determine the structure using language-specific knowledge of the format of plain text.

s == sentence

Description:

Element used to divide text into sentences. Can only occur directly within a vhml element.

Attributes: none.

Properties: Can contain all other elements, except itself and vhml.

Example:

<p>

<sentence>Today it's been raining ,</sentence>

<happy>

But they're calling for sunny skies tomorrow.

</happy>

</p>

Notes:

1: For brevity, the markup also supports <s> as exact equivalent of <sentence>. (Note: XML requires that the opening and closing elements be identical so <s> text </sentence> is not legal.). Also note that <s> means "strike-out" in HTML 4.0 and earlier, and in XHTML-1.0-Transitional but not in XHTML-1.0-Strict.

2: The use of the sentence element is optional. Where text occurs without an enclosing sentence element the speech output system should attempt to determine the structure using language-specific knowledge of the format of plain text.

say-as

Description:

The say-as element indicates the type of text construct contained within the element. This information is used to help specify the pronunciation of the contained text. Defining a comprehensive set of text format types is difficult because of the variety of languages that must be considered and because of the innate flexibility of written languages.

Attributes:

The say-as element has been specified with a reasonable set of format types. Text substitution may be utilized for unsupported constructs.

The type attribute is a required attribute that indicates the contained text construct. The format is a text type optionally followed by a colon and a format. The base set of type values, divided according to broad functionality, is as follows:

Pronunciation Types

acronym: contained text is an acronym. The characters in the contained text string are pronounced as individual characters.

<say-as type="acronym"> USA </say-as>

Numerical Types

number: contained text contains integers, fractions, floating points, Roman numerals or some other textual format that can be interpreted and spoken as a number in the current language. Format values for numbers are:

"ordinal", where the contained text should be interpreted as an ordinal. The content may be a digit sequence or some other textual format that can be interpreted and spoken as an ordinal in the current language; and

"digits", where the contained text is to be read as a digit sequence, rather than as a number.

Rocky <say-as type="number"> XIII </say-as>

Pope John the <say-as type="number:ordinal"> VI </say-as>

Deliver to <say-as type="number:digits"> 123 </say-as> Brookwood.

Time, Date and Measure Types

date: contained text is a date. Format values for dates are:

"dmy", "mdy", "ymd" (day, month , year), (month, day, year), (year, month, day)

"ym", "my", "md" (year, month), (month, year), (month, day)

"y", "m", "d" (year), (month), (day).

time: contained text is a time of day. Format values for times are:

"hms", "hm", "h" (hours, minutes, seconds), (hours, minutes), (hours).

duration: contained text is a temporal duration. Format values for durations are:

"hms", "hm", "ms", "h", "m", "s" (hours, minutes, seconds), (hours, minutes), (minutes, seconds), (hours), (minutes), (seconds).

currency: contained text is a currency amount.
measure: contained text is a measurement.
telephone: contained text is a telephone number.

<say-as type="date:ymd"> 2000/1/20 </say-as>

Proposals are due in <say-as type="date:my"> 5/2001 </say-as>

The total is <say-as type="currency"> $20.45</say-as>

When multi-field quantities are specified ("dmy", "my", etc.), it is assumed that the fields are separated by single, non-alphanumeric character.

Address, Name, Net Types

name: contained text is a proper name of a person, company etc.
net: contained text is an internet identifier. Format values for net are: "email", "uri".
address: contained text is a postal address.

<say-as type="net:email"> road.runner@acme.com </say-as>

The sub attribute is employed to indicate that the specified text replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form.

<say-as sub="World Wide Web Consortium"> W3C

</say-as>

Notes:

1: The conversion of the various types of text and text markup to spoken forms is language and platform-dependent. For example, <say-as type="date:ymd"> 2000/1/20 </say-as> may be read as "January twentieth two thousand" or as "the twentieth of January two thousand" and so on. The markup examples above are provided for usage illustration purposes only.

2: It is assumed that pronunciations generated by the use of explicit text markup always take precedence over pronunciations produced by a lexicon.

phoneme

Description:

The phoneme element provides a phonetic pronunciation for the contained text. The phoneme element may be empty. However, it is recommended that the element contain human-readable text that can be used for non-spoken rendering of the document. For example, the content may be displayed visually for users with hearing impairments.

Attributes:

The alphabet attribute is an optional attribute that specifies the phonetic alphabet.

ipa: The specified phonetic string is composed of symbols from the International Phonetic Alphabet (IPA).
worldbet: The specified phonetic string is composed of symbols from the Worldbet (Postscript) phonetic alphabet.
xsampa: The specified phonetic string is composed of symbols from the X-SAMPA phonetic alphabet.

The ph attribute is a required attribute that specifies the phoneme string.

Example:

<phoneme alphabet="ipa" ph="tɒmɑtoʊ"> tomato </phoneme>

Notes:

1: Characters composing many of the IPA phonemes are known to display improperly on most platforms. Additional IPA limitations include the fact that IPA is difficult to understand even when using ASCII equivalents, IPA is missing symbols required for many of the world's languages, and IPA editors and fonts containing IPA characters are not widely available.

2: Entity definitions may be used for repeated pronunciations. For example:

<!ENTITY uk_tomato "tɒmɑtoʊ">

... you say <phoneme ph="&uk_tomato;"> tomato </phoneme>

I say...

3: In addition to an exhaustive set of vowel and consonant symbols, IPA supports a syllable delimiter, numerous diacritics, stress symbols, lexical tone symbols, intonational markers and more.

voice

Description:

The voice element is a production element that requests a change in speaking voice.

Attributes:

gender: optional attribute indicating the preferred gender of the voice to speak the contained text. Enumerated values are: "male", "female", "neutral".
age: optional attribute indicating the preferred age of the voice to speak the contained text. Acceptable values are of type (integer)
category: optional attribute indicating the preferred age category of the voice to speak the contained text. Enumerated values are: "child" , "teenager" , "adult", "elder".
variant: optional attribute indicating a preferred variant of the other voice characteristics to speak the contained text. (e.g. the second or next male child voice). Acceptable values are of type (integer).
name: optional attribute indicating a platform-specific voice name to speak the contained text. The value may be a space-separated list of names ordered from top preference down. Acceptable values are of the form (voice-name-list).

Examples:

Mary had a little lamb,

</voice>

<voice gender="female" category="child" variant="2"> It's fleece was white as snow.

</voice>

I want to be like Mike.

</voice>

Notes:

1: When there is not a voice available that exactly matches the attributes specified in the document, the voice selection algorithm may be platform-specific.

2: Voice attributes are inherited down the tree including to within elements that change the language.

Any female voice here.

A female child voice here.

</paragraph>

</voice>

3: A change in voice resets the prosodic parameters since different voices have different natural pitch and speaking rates. Volume is the only exception. It may be possible to preserve prosodic parameters across a voice change by employing a style sheet. Characteristics specified as "+" or "-" voice attributes with respect to absolute voice attributes would not be preserved.

4: The xml:lang attribute may be used specially to request usage of a voice with a specific dialect or other variant of the enclosing language.

<voice xml:lang="en-cockney">Try a Cockney voice

(London area).</voice>

<voice xml:lang="en-brooklyn">Try one New York

accent.</voice>

emphasis

Description:

The emphasis element requests that the contained text be spoken with emphasis (also referred to as prominence or stress). The synthesizer determines how to render emphasis since the nature of emphasis differs between languages, dialects or even voices.

Conformance

This section is Normative.

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this conformance section are to be interpreted as described in RFC 2119

Conforming Virtual Human Markup Document Fragments

A Virtual Human markup document fragment is a Conforming XML Document Fragment if it adheres to the specification described in this document including the DTD (see Document Type Definition) and also:

(relative to XML) is well-formed.
if all non-Virtual Human namespace elements and attributes and all xmlns attributes which refer to non-Virtual Human namespace elements are removed from the given document, and if an appropriate XML declaration (i.e., <?xml...?>) is included at the top of the document, and if an appropriate document type declaration which points to the Virtual Human DTD is included immediately thereafter, the result is a valid XML document.
conforms to the following W3C Recommendations:
the XML 1.0 specification (Extensible Markup Language (XML) 1.0).
(if any namespaces other than Virtual Human markup are used in the document) Namespaces in XML.

The Virtual Human Markup Language or these conformance criteria provide no designated size limits on any aspect of Virtual Human markup documents. There are no maximum values on the number of elements, the amount of character data, or the number of characters in attribute values.

Conforming Stand-Alone Virtual Human Markup Language Documents

A file is a Conforming Stand-Alone Virtual Human Markup Language Document if:

it is an XML document.
its root element is a <span class="element-name">'speak'</span> element.
it conforms to the criteria for Conforming Virtual Human Markup Language Fragments.

Conforming Virtual Human Markup Language Processors

A Virtual Human Markup Language processor is a program that can parse and process Virtual Human Markup Language documents.

In a Conforming Virtual Human Markup Language Processor, the XML parser must be able to parse and process all XML constructs defined within XML 1.0 and XML Namespaces.

A Conforming Virtual Human Markup Language Processor must correctly understand and apply the command logic defined for each markup element as described by this document. Exceptions to this requirement are allowed when an xml:lang attribute is utilized to specify a language not present on a given platform, and when a non-enumerated attribute value is specified that is out-of-range for the platform. The response of the Conforming Virtual Human Markup Language Processor in both cases would be platform-dependent.

A Conforming Virtual Human Markup Language Processor should inform its hosting environment if it encounters an element, element attribute, or syntactic combination of elements or attributes that it is unable to support. A Conforming Virtual Human Markup Language Processor should also inform its hosting environment if it encounters an illegal Virtual Human document or unknown XML entity reference.

The Rendering

FAP / BAP / TTS Renderer

Rendering Web Page Direct - XML

Rendering Web Page via engines

References

Normative.

Java Speech API Markup Language

http://java.sun.com/products/java-media/speech/forDevelopers/JSML/index.html JSML is an XML specification for controlling text-to-speech engines. Implementations are available from IBM, Lernout & Hauspie and in the Festival speech synthesis platform and in other implementations of the Java Speech API.

S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", RFC 2119, Harvard University, March 1997

Informative.

SABLE

http://www.research.att.com/~rws/Sable.v1_0.htm SABLE is a markup language for controlling text to speech engines. It has evolved out of work on combining three existing text to speech languages: SSML, STML and JSML. Implementations are available for the Bell Labs synthesizer and in the Festvial speech synthesizer. The following are two of the papers written about SABLE and its applications:

SABLE: A Standard for TTS Markup, Sproat et. al. (http://www.research.att.com/~rws/SABPAP/sabpap.htm)
SABLE: an XML-based Aural Display List For The WWW, Sproat and Raman. (http://www.bell-labs.com/project/tts/csssable.html)

Spoken Text Markup Language

(http://www.cstr.ed.ac.uk/publications/1997/Sproat_1997_a.ps) STML is an SGML language for controlling text to speech engines developed jointly by Bell Laboratories and by the Centre for Speech Technology Research, Edinburgh University.

Microsoft Speech API Control Codes

(http://www.microsoft.com/iit/) SAPI defines a set of inline control codes for manipulating speech output by SAPI speech synthesizers.

VoiceXML Prompts

(http://www.voicexml.com/) The Voice XML specification for dialog systems development includes a set of prompt elements for generating speech synthesis and other audio output that are very similar to elements of JSML and SABLE.

Pelachaud, C. and Prevost, S. (1995) Talking heads: Physical, linguistic and cognitive issues in facial animation. Course Notes for Computer Graphics International '95.

Acknowledgements

This document was ripped off from various sources as a Working Draft.

Name	Description	Values
target	Specifies which phoneme in contained text will be the target phoneme. If target is not specified, default target will be the first phoneme found within the contained text.	A character string representing a phoneme symbol. Uses the MPRA phoneme set.
level	The strength of the emphasis. (Default level is weak).	weakest, weak, moderate, strong.
affect	Specifies if the element is to affect the contained text's phoneme pitch values, or duration values, or both. (Default is pitch only).	p - affect pitch only. d - affect duration only. b - affect both pitch and duration.

Name	Description	Values
length	Specified the length of the utterance using descriptive value.	short, medium, long.
msec	Specifies the length of the utterance in seconds or milliseconds.	A positive number.
smooth	Specifies if the last phonemes before this pause need to be lengthened slightly.	yes, no (default = yes)