Home | About Us | Solutions | Tutorials | Barcode Education | Biometric Education | Contact Us
 
Biometric Education » Voice Recognition
Introduction
Voice Recognition is a technology which allows a user to use his/her voice as an input device. Voice recognition may be used to dictate text into the computer or to give commands to the computer (such as opening application programs, pulling down menus, or saving work).

Older voice recognition applications require each word to be separated by a distinct space. This allows the machine to determine where one word begins and the next stops. These kinds of voice recognition applications are still used to navigate the computer's system, and operate applications such as web browsers or spread sheets.

Newer voice recognition applications allow a user to dictate text fluently into the computer. These new applications can recognize speech at up to 160 words per minute. Applications that allow continuous speech are generally designed to recognize text and format it, rather then controlling the computer system itself.

Voice recognition uses a neural net to "learn" to recognize your voice. As you speak, the voice recognition software remembers the way you say each word. This customization allows voice recognition, even though everyone speaks with varying accents and inflection.

In addition to learning how you pronounce words a voice recognition also uses grammatical context and frequency of use to predict the word you wish to input. These powerful statistical tools allow the software to cut down the massive language data base before you even speak the next word.

While the accuracy of voice recognition has improved over the past few years some users still experience problems with accuracy either because of the way they speak or the nature of their voice.

How it Works
Voice recognition technology utilizes the distinctive aspects of the voice to verify the identity of individuals. Voice recognition is occasionally confused with speech recognition, a technology which translates what a user is saying (a process unrelated to authentication). Voice recognition technology, by contrast, verifies the identity of the individual who is speaking. The two technologies are often bundled – speech recognition is used to translate the spoken word into an account number, and voice recognition verifies the vocal characteristics against those associated with this account.

Voice recognition can utilize any audio capture device, including mobile and land telephones and PC microphones. The performance of voice recognition systems can vary according to the quality of the audio signal as well as variation between enrollment and verification devices, so acquisition normally takes place on a device likely to be used for future verification.

During enrollment an individual is prompted to select a passphrase or to repeat a sequence of numbers. The passphrases selected should be approximately 1-1.5 seconds in length – very short passphrases lack enough identifying data, and long passwords have too much, both resulting in reduced accuracy. The individual is generally prompted to repeat the passphrase or number set a handful of times, making the enrollment process somewhat longer than most other biometrics.

Strengths and Weaknesses
One of the challenges facing large-scale implementations of biometrics is the need to deploy new hardware to employees, customers and users. One strength of telephony-based voice recognition implementations is that they are able to circumvent this problem, especially when they are implemented in call center and account access applications. Without additional hardware at the user end, voice recognition systems can be installed as a subroutine through which calls are routed before access to sensitive information is granted. The ability to use existing telephones means that voice recognition vendors have hundreds of millions of authentication devices available for transactional usage today.

Similarly, voice recognition is able to leverage existing account access and authentication processes, eliminating the need to introduce unwieldy or confusing authentication scenarios. Automated telephone systems utilizing speech recognition are currently ubiquitous due to the savings possible by reducing the amount of employees necessary to operate call centers. Voice recognition and speech recognition can function simultaneously using the same utterance, allowing the technologies to blend seamlessly. Voice recognition can function as a reliable authentication mechanism for automated telephone systems, adding security to automated telephone-based transactions in areas such as financial services and health care.

Though inconsistent with many users’ perceptions, certain voice recognition technologies are highly resistant to imposter attacks, even more so than some fingerprint systems. While false non-matching can be a common problem, this resistance to false matching means that voice recognition can be used to protect reasonably high-value transactions.

Since the technology has not been traditionally used in law enforcement or tracking applications where it could be viewed as a Big Brother technology, there is less public fear that voice recognition data can be tracked across databases or used to monitor individual behavior. Thus, voice recognition largely avoids one of the largest hurdles facing other biometric technologies, that of perceived invasiveness.

Voice Recognition Applications
Voice recognition is a strong solution for implementations in which vocal interaction is already present. It is not a strong solution when speech is introduced as a new process. Telephony is the primary growth area for voice recognition, and will likely be by far the most common area of implementation for the technology. Telephony-based applications for voice recognition include account access for financial services, customer authentication for service calls, and challenge-response implementations for house arrest and probation-related authentication. These solutions route callers through enrollment and verification subroutines, using vendor-specific hardware and software integrated with an institution's existing infrastructure.

Voice recognition has also been implemented in physical access solutions for border crossing, although this is not the technology's ideal deployment environment.

Voice Recognition Market Size
Though revenues from the technology are relatively small today, voice recognition will draw substantially greater revenues through 2007. Most likely to be deployed in telephony-based environments (such as account access for financial services and customer authentication for service calls), voice recognition revenues are projected to grow from $12.2m in 2002 to $142.1m in 2007. Voice recognition revenues are expected to comprise approximately 4% of the entire biometric market.

Voice Verification in Telephone Banking
Telephone banking is increasingly popular with customers, and will be increasingly attractive to banks and other financial institutions as they start to implement highly cost effective automated speech recognition technology to handle routine transactions (the subject of another "financial futures" web page).

But the procedures for verifying customers over the telephone are unsatisfactory, both in terms of customer convenience and also, increasingly, from a security point of view.

The Problem

The usual approach to verifying customers - proving that they are who they claim to be - is to use some sort of PIN or password. To avoid the customer having to say the password out loud, they are usually prompted for, say, the second and fourth letters in the password.

There are several problems with this approach:
* Firstly, passwords and PINs are difficult to remember and unwieldy for customers to use in this manner.
* Secondly, it takes time - identification and verification of the caller is often the lengthiest component of a transaction and this translates directly to the bottom line.
* Thirdly, the security itself leaves a lot to be desired - many customers write down their passwords or reveal them to the operator (in extreme cases they may self select the same PIN that they use for ATM withdrawals). Many call centres prompt the caller for additional 'secret' items such as their mother's maiden name, but this only exacerbates the other two problems.

The Solution ? Voice Verification
Technology now exists which enables individuals to be reliably, rapidly and cost-effectively verified on the basis of the physical characteristics of their voice.

Several vendors now supply commercial voice verification technology. A good example is Nuance Communications, based in California, using essentially the same technology which underlies their speaker independent speech recognition software. But in this case recognition is speaker dependent - the customer is only allowed to use the system if their individual voiceprint matches their identity (normally established though an account number).

A new customer automatically enrolls in the system over the telephone by repeating about 10 four digit numbers or reading a short piece of text. The software extracts from this a number of physical characteristics which are unique to that voice. In all subsequent transactions, the caller, once identified, is asked to repeat a couple of randomly generated PINs or, for example, names of cities (this is to prevent fraudsters tape-recording a customer saying their password or PIN). If the voiceprint matches the one stored against the account number the transaction proceeds; if not, the customer is referred to a supervisor.

Pilot tests of the technology are encouraging. A high accuracy of correct verification can be combined with a low probability of false rejection which is suitable for most banking operations and the whole procedure is faster, easier and much more cost effective. Surprisingly, only a few kilobytes of storage are required for each voiceprint and because the claimed identity of the customer is already established, a single comparison is all that is required, so verification is quite rapid (using the same technology for voice identification is of course much slower since the system must find a match out of many voiceprints).

Voice verification is particularly appropriate for automated speech recognition dialogues and we expect that a seamless combination of the two technologies will rapidly become the norm for most simple telephone banking transactions.

Of course voice verification is much less applicable to other delivery channels such as branch banking or screen-based systems (although pilot systems have been built). For an intriguing new approach to customer verification over the Internet based on face recognition, see the "financial futures" web page on Passfaces or check out the ID-Arts web site.

Details
The speaker-specific characteristics of speech are due to differences in physiological and behavioral aspects of the speech production system in humans. The main physiological aspect of the human speech production system is the vocal tract shape. The vocal tract is generally considered as the speech production organ above the vocal folds, which consists of the following: (i) laryngeal pharynx (beneath the epiglottis), (ii) oral pharynx (behind the tongue, between the epiglottis and velum), (iii) oral cavity (forward of the velum and bounded by the lips, tongue, and palate), (iv) nasal pharynx (above the velum, rear end of nasal cavity), and (v) nasal cavity (above the palate and extending from the pharynx to the nostrils). The shaded area in figure 1 depicts the vocal tract.


The vocal tract modifies the spectral content of an acoustic wave as it passes through it, thereby producing speech. Hence, it is common in speaker verification systems to make use of features derived only from the vocal tract. In order to characterize the features of the vocal tract, the human speech production mechanism is represented as a discrete-time system of the form depicted in figure 2.


The acoustic wave is produced when the airflow from the lungs is carried by the trachea through the vocal folds. This source of excitation can be characterized as phonation, whispering, frication, compression, vibration, or a combination of these. Phonated excitation occurs when the airflow is modulated by the vocal folds. Whispered excitation is produced by airflow rushing through a small triangular opening between the arytenoid cartilage at the rear of the nearly closed vocal folds. Frication excitation is produced by constrictions in the vocal tract. Compression excitation results from releasing a completely closed and pressurized vocal tract. Vibration excitation is caused by air being forced through a closure other than the vocal folds, especially at the tongue. Speech produced by phonated excitation is called voiced, that produced by phonated excitation plus frication is called mixed voiced, and that produced by other types of excitation is called unvoiced.

It is possible to represent the vocal-tract in a parametric form as the transfer function H(z). In order to estimate the parameters of H(z) from the observed speech waveform, it is necessary to assume some form for H(z). Ideally, the transfer function should contain poles as well as zeros. However, if only the voiced regions of speech are used then an all-pole model for H(z) is sufficient. Furthermore, linear prediction analysis can be used to efficiently estimate the parameters of an all-pole model. Finally, it can also be noted that the all-pole model is the minimum-phase part of the true model and has an identical magnitude spectra, which contains the bulk of the speaker-dependent information.

The above discussion also underlines the text-dependent nature of the vocal-tract models. Since the model is derived from the observed speech, it is dependent on the speech. Figure 3 illustrates the differences in the models for two speakers saying the same vowel.


Choice of features
The LPC features were very popular in the early speech-recognition and speaker-verification systems. However, comparison of two LPC feature vectors requires the use of computationally expensive similarity measures such as the Itakura-Saito distance and hence LPC features are unsuitable for use in real-time systems. Furui suggested the use of the cepstrum, defined as the inverse Fourier transform of the logarithm of the magnitude spectrum, in speech-recognition applications. The use of the cepstrum allows for the similarity between two cepstral feature vectors to be computed as a simple Euclidean distance. Furthermore, Atal has demonstrated that the cepstrum derived from the LPC features results in the best performance in terms of FAR and FRR for a speaker verification system. Consequently, we have decided to use the LPC derived cepstrum for our speaker verification system.

Speaker Modeling
Using cepstral analysis as described in the previous section, an utterance may be represented as a sequence of feature vectors. Utterances spoken by the same person but at different times result in similar yet a different sequence of feature vectors. The purpose of voice modeling is to build a model that captures these variations in the extracted set of features. There are two types of models that have been used extensively in speaker verification and speech recognition systems: stochastic models and template models. The stochastic model treats the speech production process as a parametric random process and assumes that the parameters of the underlying stochastic process can be estimated in a precise, well defined manner. The template model attempts to model the speech production process in a non-parametric manner by retaining a number of sequences of feature vectors derived from multiple utterances of the same word by the same person. Template models dominated early work in speaker verification and speech recognition because the template model is intuitively more reasonable. However, recent work in stochastic models has demonstrated that these models are more flexible and hence allow for better modeling of the speech production process. A very popular stochastic model for modeling the speech production process is the Hidden Markov Model (HMM). HMMs are extensions to the conventional Markov models, wherein the observations are a probabilistic function of the state, i.e., the model is a doubly embedded stochastic process where the underlying stochastic process is not directly observable (it is hidden). The HMM can only be viewed through another set of stochastic processes that produce the sequence of observations. Thus, the HMM is a finite-state machine, where a probability density function p(x | s_i) is associated with each state s_i. The states are connected by a transition network, where the state transition probabilities are a_{ij} = p(s_i | s_j). A fully connected three-state HMM is depicted in figure 4.

For speech signals, another type of HMM, called a left-right model or a Bakis model, is found to be more useful. A left-right model has the property that as time increases, the state index increases (or stays the same)-- that is the system states proceed from left to right. Since the properties of a speech signal change over time in a successive manner, this model is very well suited for modeling the speech production process.

Pattern Matching
The pattern matching process involves the comparison of a given set of input feature vectors against the speaker model for the claimed identity and computing a matching score. For the Hidden Markov models discussed above, the matching score is the probability that a given set of feature vectors was generated by the model.


External Links
Privacy Policy | Terms of Use Copyright © Rosistem