A Survey of Problems in Text to Speech by Computer

by Scott Jann

University of Minnesota
Linguistics 3971
Winter 1998

1.0 Introduction

One area of computational linguistics which has many practical applications is synthetic speech synthesis. The ability to convert written text into an understandable speech signal is useful for many purposes. It is more natural for any textual information on a computer to be spoken to the user, rather than the user having to read the screen. Not only is it more natural, for users who can’t read, such as the blind, it is the only option.

This paper looks at a particular piece of synthetic speech synthesis, as mentioned above, converting textual information to an audible speech signal. Specifically, it looks at unrestricted text to speech, which is to say the ability to process any arbitrary text. Arbitrary text could be the result of a database query, the contents of a web page, or just a phrase entered in by the user. This task is the same as reading when done by a person.

The idea of unrestricted input is important because this task differs considerably from a task with a restricted vocabulary. In the case of restricted vocabulary, it is a reasonable solution to simply record the required words or phrases. For example, a talking clock which announces the time on the hour could have the list of one through twelve o’clock recorded, or a subway which announces the location could have the phrase “The train is now at…”, which could be played before a list of each stop the train makes, recorded on a tape or stored in a computer chip.

If someone took this approach to the task of being able to synthesize a speech signal for every phrase a user might enter, he’d quickly run out of resources for recording. Instead of this approach, this person might move on and, since he bought all the resources for the first attempt before realizing what he was doing was futile, realize that 10,000 words are sufficient for most purposes. With this assumption, he could set out to record someone saying these 10,000 words. This person would be very disappointed at the result, patchwork of discontinuous words. A phrase would sound either like a completely disjointed list of words, or if the words were recorded with the same tone, it might sound like a slow monotonous rambling.

The point of this description is to show that converting unrestricted text to speech is not at all a simple problem. When we speak, words blend into each other, certain syllables are stressed, tone falls over certain portions of a phrase. Not only do we do this, we expect it, and rely on it for understanding what is said. As above, it is not feasible to record every combination of how words can be arranged, much less how they can vary in stress and other ways. The solution is to build a speech signal from scratch by applying knowledge from every area of linguistics: phonetics, phonology, morphology, syntax, and semantics.

2.0 Background

In order to synthesize a speech signal, the process is very roughly taking the input, converting it into detailed phonetic data, then converting this into a waveform that can be sent to a speaker or stored to playback later. However, it isn’t as simple as taking the words, converting them to a phonetic representation and then passing that into a mechanism to generate a speech signal. Even if it was that simple, it wouldn’t be simple. For example, working with English often requires handling groups of letters in the written form which don’t have any obvious relation to the actual pronunciation (i.e. tough), on top of that there are homographs and foreign words which can’t possibly fit into a set of rules for matching spelling to pronunciation. In addition to the raw sounds which makes up how a word is pronounced, one must add in stressing the appropriate syllables, intonation implied by the punctuation, and adding pauses for punctuation as well as where a person would need to take a breath.

2.1 Implementation of MITalk

One of the most valuable resources I found in trying to understand how a text to speech system was built, was by the authors of the MITalk system (see Allen et al.). They documented how they implemented their real text to speech system. This was particularity useful since they walk through every stage of how they tackled the problem. It is also quite unique as far as a detailed description of implementation. The other systems I looked at are all on the market now, therefore for sake of competition, detailed information on their implementation is not published. Another reason that MITalk is valuable to look at is that it was a ground breaking project which has paved the way (with the MITalk documentation as a road map) for all of the modern systems.

To introduce the steps involved in text to speech, I will summarize the process which MITalk uses:

2.1.1 Preprocessing Stage
2.1.2 Syntactic Stage
2.1.3 Phonological Stage
2.1.4 Phonetic Stage

3.0 Analysis

3.1 Method

I looked for potentially problematic areas in each stage of the text to speech process. This includes aspects which MITalk didn’t address or were troublesome in its implementation. The next section describes these issues and also how several of the modern text to speech systems handled a simple test related to the issue.

Due to the availability of the packages that exist, the testing I did was quite limited, and for the most part limited to three systems, TrueTalk which is produced by Entropic Research Laboratory, PlainTalk which is produced by Apple Computer, and Laureate which is produced by BT Labs. The PlainTalk system is available for Macintosh computers free of charge from Apple (it also comes as part of the Macintosh operating system). Since I use a Macintosh, access to this was quite easy. Both the Laureate and TrueTalk systems on the other hand I accessed via a demo on their web sites. These were interactive demos, to which you can submit text, and it will produce a sound file in several formats which you can then download. I couldn’t perform the same level of examination to the other two systems (DECTalk from Digital and EUROVOCS from ELIS) because they didn’t have a way to interactively try the product online, and due to cost and access to hardware to use the products purchasing the full version was impossible.

3.2 Issues

3.2.1 Preprocessing Stage

First of all, while preprocessing the input, one must take into account various pronunciations of numbers. The number 1,000,000 should be pronounced “one-million”. Amounts of money such as “$24.04” are pronounced “twenty-four dollars and four cents”. Also numbers after the decimal point in other numbers must be treated differently, for example “2.10” is pronounced “two point one zero”. Lastly years such as “1944” differ from standard numbers “nineteen fourty-four” vs. “one-thousand nine hundred and fourty-four”.

This was handled in the text preprocessor of MITalk. This set of rules handles all variations of numbers, although it ignores that people can freely say either “one-thousand one-hundred” or “eleven hundred”.

The only example related to numbers is from passing “1,100” and “1100” to the PlainTalk system. It produced “one-thousand one-hundred” and “eleven hundred” respectively.

Abbreviations aren’t as simple as one might think. For instance “Dr.” could mean “doctor” or “drive”. For the record, MITalk treated “Dr.” as “doctor” always. I experimented with this example on PlainTalk (with the Victoria, High Quality voice, since this varies by what voice is selected), and found that it can produce the correct output if the word before or after the abbreviation is capitalized. If the capitalization is missing it produces drive unless there is no word before the abbreviation.

3.2.2 Syntactic Stage

The handling of intonation correctly, including parenthetical expressions, questions, and statements is dependent on the syntactic parser as well. Intonation gives the listener not only information about weather the phrase is a question or declarative, but also can give clues about things like noun phrase boundaries and can be used to emphasize or “question” particular elements (the latter being at the semantic level).

I tried listening to a number of questions with PlainTalk. I found it to be very poor at marking them. A declarative phrase and a question sounded exactly the same, except in the question it would arc into a higher register during the last word, in a way which sounded quite unnatural. Laureate, however sounded considerably more natural with the same question.

3.2.3 Phonological Stage

The way these packages deal with stress seems to be pretty accurate. For instance, when I tried producing the name “Casey” and the letters “K.C.” with both PlainTalk and TrueTalk, the signal was stressed correctly, and I had no trouble distinguishing which was which. As far as non-contrastive sounds, none of the samples I listened to had anything that sounded awkward as far as stress.

Another difficulty is handling homographs. This is dependent on having a robust syntactic parser, and having recorded (in the lexicon, or by unique noun and verb rules) that the noun “project” is not pronounced like the verb “project”, or “read” is pronounced differently if it is a main verb than if it is a participle. Both TrueTalk and PlainTalk were capable of distinguishing these two examples. TrueTalk additionally was able to distinguish somewhat more obscure (homograph) names such as the phrase “Nice ([ni:s] the city in France) is a nice place to visit.”

3.2.4 Phonetic Stage

Generating the speech signal after all of this processing has been done is the final step. It is also one of the most important, since it is what the listener hears. Every package that I have heard has some sort of robotic quality to the voice. It is my guess that when the formants are being produced, the computer is much too precise, and there aren’t any minor variations which creep in like when a person produces speech. If we had a complete picture of human auditory perception, it might be possible to find what is leading to perceivable artifacts in the signal.

It is also worth noting that since the time that the MITalk system was written in 1979, the hardware to produce sound has become standard on personal computers. It is no longer worthwhile to make a proprietary device for this purpose. Thus, the actual sound production from the system isn’t an issue any longer.

3.2.5 Semantic Level

These systems use punctuation and some syntactic information in determining the relative speed and for pausing. In the systems that I listened to, the output is understandable, but not quite natural. It seems that semantic information can affect these variables as well, and that is what is missing. For instance, commas and periods vary in how much of a pause they are depending on the relationship of the phrases on either side.

It appears that TrueTalk is able to do some semantic level processing. It boasts to handle “There are senators and there are good senators.”, where the “good” is emphasized. This is probably just recognizing that the same head noun is being conjoined with itself, and takes it as a special case. It then applies stress to the adjective after the conjunction.

On the topic of recognizing which word is being “questioned” in a question, I think it is beyond the scope of any system to determine to add stress to a particular word. Unless it is marked in written text (by bold, or capital letters), just as it is a human reader except in certain contexts, it is nearly impossible to recognize what is questioned. This is really a tool we use solely in discourse.

4.0 Summary

None of the systems I heard were perfect, but they were for the most part understandable. In the future we’ll continue to learn more about how humans produce and perceive speech. This can then be applied to refining text to speech methods in a way that is closer to the way humans expect, thus sounding more natural. The more natural it sounds, the easier it is to understand, and the more pleasant it is to listen to.

5.0 References

5.1 Selected Text to Speech Packages: