A Survey of Problems in Text to Speech by Computer
by Scott Jann
University of Minnesota
Linguistics 3971
Winter 1998
1.0 Introduction
One area of computational linguistics which has many practical applications is synthetic speech synthesis. The ability to convert written text into an understandable speech signal is useful for many purposes. It is more natural for any textual information on a computer to be spoken to the user, rather than the user having to read the screen. Not only is it more natural, for users who can’t read, such as the blind, it is the only option.
This paper looks at a particular piece of synthetic speech synthesis, as mentioned above, converting textual information to an audible speech signal. Specifically, it looks at unrestricted text to speech, which is to say the ability to process any arbitrary text. Arbitrary text could be the result of a database query, the contents of a web page, or just a phrase entered in by the user. This task is the same as reading when done by a person.
The idea of unrestricted input is important because this task differs considerably from a task with a restricted vocabulary. In the case of restricted vocabulary, it is a reasonable solution to simply record the required words or phrases. For example, a talking clock which announces the time on the hour could have the list of one through twelve o’clock recorded, or a subway which announces the location could have the phrase “The train is now at…”, which could be played before a list of each stop the train makes, recorded on a tape or stored in a computer chip.
If someone took this approach to the task of being able to synthesize a speech signal for every phrase a user might enter, he’d quickly run out of resources for recording. Instead of this approach, this person might move on and, since he bought all the resources for the first attempt before realizing what he was doing was futile, realize that 10,000 words are sufficient for most purposes. With this assumption, he could set out to record someone saying these 10,000 words. This person would be very disappointed at the result, patchwork of discontinuous words. A phrase would sound either like a completely disjointed list of words, or if the words were recorded with the same tone, it might sound like a slow monotonous rambling.
The point of this description is to show that converting unrestricted text to speech is not at all a simple problem. When we speak, words blend into each other, certain syllables are stressed, tone falls over certain portions of a phrase. Not only do we do this, we expect it, and rely on it for understanding what is said. As above, it is not feasible to record every combination of how words can be arranged, much less how they can vary in stress and other ways. The solution is to build a speech signal from scratch by applying knowledge from every area of linguistics: phonetics, phonology, morphology, syntax, and semantics.
2.0 Background
In order to synthesize a speech signal, the process is very roughly taking the input, converting it into detailed phonetic data, then converting this into a waveform that can be sent to a speaker or stored to playback later. However, it isn’t as simple as taking the words, converting them to a phonetic representation and then passing that into a mechanism to generate a speech signal. Even if it was that simple, it wouldn’t be simple. For example, working with English often requires handling groups of letters in the written form which don’t have any obvious relation to the actual pronunciation (i.e. tough), on top of that there are homographs and foreign words which can’t possibly fit into a set of rules for matching spelling to pronunciation. In addition to the raw sounds which makes up how a word is pronounced, one must add in stressing the appropriate syllables, intonation implied by the punctuation, and adding pauses for punctuation as well as where a person would need to take a breath.
2.1 Implementation of MITalk
One of the most valuable resources I found in trying to understand how a text to speech system was built, was by the authors of the MITalk system (see Allen et al.). They documented how they implemented their real text to speech system. This was particularity useful since they walk through every stage of how they tackled the problem. It is also quite unique as far as a detailed description of implementation. The other systems I looked at are all on the market now, therefore for sake of competition, detailed information on their implementation is not published. Another reason that MITalk is valuable to look at is that it was a ground breaking project which has paved the way (with the MITalk documentation as a road map) for all of the modern systems.
To introduce the steps involved in text to speech, I will summarize the process which MITalk uses:
2.1.1 Preprocessing Stage
- Text preprocessing, this handles things like breaking numbers into their written form, and expanding abbreviations.
- Morphological analysis takes each word and finds its part(s) of speech, and breaks the word down into morphemes if possible. This breakdown consists of finding any affixes which may be in the word. This stage employs a lexicon of nearly 12,000 morphemes and for each entry it has information on pronunciation and part of speech. This piece also takes into account morphemes like “fire” + “ing” which aren’t entirely recognizable in the surface form of “firing”.
2.1.2 Syntactic Stage
- Phrase-level parser which attempts to syntactically parse the given input using an ATN (augmented transition network) grammar. The result of this is a tree which describes the syntactic relationships in the phrase like [S [NP [Det the] [N dog]] [VP [V barked]]].
2.1.3 Phonological Stage
- Morphophonemics and stress adjustment, this stage takes the output from the morphological analysis and phrase-level parsing and outputs a stream of phonetic segments which are marked for stress, and syllable information.
- Letter-to-sound and lexical stress, this stage handles finding a pronunciation for words that weren’t found in the previous stages. It employs a number of rules for converting letters to sounds, and rules for how the sounds should be stressed.
- The Phonological component applies syntactic information from the phrase-level parser stage to make any appropriate changes. It also applies several rules which allow it to resolve allophones. This stage also handles adding pauses for various punctuation or to simulate taking a breath.
- The prosodic component assigns each segment with a stress level and a duration.
2.1.4 Phonetic Stage
- The fundamental frequency generator uses the stress information and part of speech information to generate a frequency for the voicing throughout the signal.
- The phonetic component creates parameters for every 5ms of the signal, to be fed to the synthesizer control mechanism. These parameters represent all of the voicing and formant frequencies for the signal.
- The Klatt formant synthesizer takes the signal parameters and generates an audio signal through a loudspeaker.
3.0 Analysis
3.1 Method
I looked for potentially problematic areas in each stage of the text to speech process. This includes aspects which MITalk didn’t address or were troublesome in its implementation. The next section describes these issues and also how several of the modern text to speech systems handled a simple test related to the issue.
Due to the availability of the packages that exist, the testing I did was quite limited, and for the most part limited to three systems, TrueTalk which is produced by Entropic Research Laboratory, PlainTalk which is produced by Apple Computer, and Laureate which is produced by BT Labs. The PlainTalk system is available for Macintosh computers free of charge from Apple (it also comes as part of the Macintosh operating system). Since I use a Macintosh, access to this was quite easy. Both the Laureate and TrueTalk systems on the other hand I accessed via a demo on their web sites. These were interactive demos, to which you can submit text, and it will produce a sound file in several formats which you can then download. I couldn’t perform the same level of examination to the other two systems (DECTalk from Digital and EUROVOCS from ELIS) because they didn’t have a way to interactively try the product online, and due to cost and access to hardware to use the products purchasing the full version was impossible.
3.2 Issues
3.2.1 Preprocessing Stage
First of all, while preprocessing the input, one must take into account various pronunciations of numbers. The number 1,000,000 should be pronounced “one-million”. Amounts of money such as “$24.04” are pronounced “twenty-four dollars and four cents”. Also numbers after the decimal point in other numbers must be treated differently, for example “2.10” is pronounced “two point one zero”. Lastly years such as “1944” differ from standard numbers “nineteen fourty-four” vs. “one-thousand nine hundred and fourty-four”.
This was handled in the text preprocessor of MITalk. This set of rules handles all variations of numbers, although it ignores that people can freely say either “one-thousand one-hundred” or “eleven hundred”.
The only example related to numbers is from passing “1,100” and “1100” to the PlainTalk system. It produced “one-thousand one-hundred” and “eleven hundred” respectively.
Abbreviations aren’t as simple as one might think. For instance “Dr.” could mean “doctor” or “drive”. For the record, MITalk treated “Dr.” as “doctor” always. I experimented with this example on PlainTalk (with the Victoria, High Quality voice, since this varies by what voice is selected), and found that it can produce the correct output if the word before or after the abbreviation is capitalized. If the capitalization is missing it produces drive unless there is no word before the abbreviation.
3.2.2 Syntactic Stage
The handling of intonation correctly, including parenthetical expressions, questions, and statements is dependent on the syntactic parser as well. Intonation gives the listener not only information about weather the phrase is a question or declarative, but also can give clues about things like noun phrase boundaries and can be used to emphasize or “question” particular elements (the latter being at the semantic level).
I tried listening to a number of questions with PlainTalk. I found it to be very poor at marking them. A declarative phrase and a question sounded exactly the same, except in the question it would arc into a higher register during the last word, in a way which sounded quite unnatural. Laureate, however sounded considerably more natural with the same question.
3.2.3 Phonological Stage
The way these packages deal with stress seems to be pretty accurate. For instance, when I tried producing the name “Casey” and the letters “K.C.” with both PlainTalk and TrueTalk, the signal was stressed correctly, and I had no trouble distinguishing which was which. As far as non-contrastive sounds, none of the samples I listened to had anything that sounded awkward as far as stress.
Another difficulty is handling homographs. This is dependent on having a robust syntactic parser, and having recorded (in the lexicon, or by unique noun and verb rules) that the noun “project” is not pronounced like the verb “project”, or “read” is pronounced differently if it is a main verb than if it is a participle. Both TrueTalk and PlainTalk were capable of distinguishing these two examples. TrueTalk additionally was able to distinguish somewhat more obscure (homograph) names such as the phrase “Nice ([ni:s] the city in France) is a nice place to visit.”
3.2.4 Phonetic Stage
Generating the speech signal after all of this processing has been done is the final step. It is also one of the most important, since it is what the listener hears. Every package that I have heard has some sort of robotic quality to the voice. It is my guess that when the formants are being produced, the computer is much too precise, and there aren’t any minor variations which creep in like when a person produces speech. If we had a complete picture of human auditory perception, it might be possible to find what is leading to perceivable artifacts in the signal.
It is also worth noting that since the time that the MITalk system was written in 1979, the hardware to produce sound has become standard on personal computers. It is no longer worthwhile to make a proprietary device for this purpose. Thus, the actual sound production from the system isn’t an issue any longer.
3.2.5 Semantic Level
These systems use punctuation and some syntactic information in determining the relative speed and for pausing. In the systems that I listened to, the output is understandable, but not quite natural. It seems that semantic information can affect these variables as well, and that is what is missing. For instance, commas and periods vary in how much of a pause they are depending on the relationship of the phrases on either side.
It appears that TrueTalk is able to do some semantic level processing. It boasts to handle “There are senators and there are good senators.”, where the “good” is emphasized. This is probably just recognizing that the same head noun is being conjoined with itself, and takes it as a special case. It then applies stress to the adjective after the conjunction.
On the topic of recognizing which word is being “questioned” in a question, I think it is beyond the scope of any system to determine to add stress to a particular word. Unless it is marked in written text (by bold, or capital letters), just as it is a human reader except in certain contexts, it is nearly impossible to recognize what is questioned. This is really a tool we use solely in discourse.
4.0 Summary
None of the systems I heard were perfect, but they were for the most part understandable. In the future we’ll continue to learn more about how humans produce and perceive speech. This can then be applied to refining text to speech methods in a way that is closer to the way humans expect, thus sounding more natural. The more natural it sounds, the easier it is to understand, and the more pleasant it is to listen to.
5.0 References
- Allen, Jonathan, et al. From text to speech: The MITalk system, Cambridge University Press, Cambridge, 1987.
- Bristow, Geoff. Electronic speech synthesis, McGraw-Hill, New York, 1984.
- O’Grady, William, et al. Contemporary Linguistics, St. Martin’s Press, New York, 1991.