PPT On Speaker Recognition
Download
Description:
1. STUDY OF SPEAKER RECOGNITION 2. INTRODUCTION
Speaker recognition is the process of automatically recognizing who is speaking on the basis of individual information included in the input speech waves.
This technique makes it possible to use the speaker’s voice to verify their identity and control access to services such as voice dialing, voice mail, telephone shopping, security control for confidential information areas and many more.
3. OBJECTIVE
To extract, characterize and recognize the information about speaker identity.
It consists of comparing a speech signal from an unknown speaker to a set of stored data of known speakers. The system can recognize the speaker, which has been trained with a number of speakers. This process determines who has spoken by matching input signal with pre- stored samples.
4. Principles of Speaker Recognition
The human speech contains numerous discriminative features that can be used to identify speakers. Speech contains significant energy from zero frequency up to around 5kHz. The speech signal is a slowly timed varying signal but when examined over a sufficiently short period of time, its characteristics are fairly stationery. Therefore, short-time spectral analysis is the most common way to characterize the speech signal.
5. Speaker recognition methods can be divided into
text-independent
text-dependent
In a text-independent system, task is to identify the person who speaks irrespective of what one is saying whereas in text-dependent system , the recognition of the speaker’s identity is based on his or her speaking one or more specific phrases, like passwords, PIN codes, etc.
Here we are describing text-independent speaker identification system.
6. Speaker recognition is basically identification and verification.
Speaker identification is the process of determining which registered speaker provides a given utterance, on the other hand verification is the process of accepting or rejecting the identity claim of a speaker.
Speaker recognition systems contain two main modules:
Feature extraction
Feature matching
Feature extraction is the process that extracts a small amount of data from the voice signal that can later be used to represent each speaker.
Feature matching involves the actual procedure to identify the unknown speaker by comparing extracted features from his/her voice input with the ones from a set of known speakers.
7. All speaker recognition systems have two distinguished phases:
Enrollment or training phase
It is the process of familiarizing the system with the voice characteristics of the speakers registering so that the system can build reference models for those speakers.
Input speech → feature extraction →generate reference model
Operational or testing phase
Testing is the actual recognition task. In this phase, the input speech is matched with stored reference models and a recognition decision is made.
Test speech→ feature extraction → comparison→ decision
↑
Reference
8. Speech Feature extraction
It is Signal-processing front end :
In this sampled speech signal is converted into set of feature vectors which characterize the properties of speech that can separate different speakers, performed both in training and testing phases.
Here, Parametrical representation of speech signal is done using Mel-frequency Ceptrum coefficients(MFCC).
MFCC is based on the human peripheral auditory system. This technique uses two types of filters, linearly spaced filters and logarithmically spaced filters to capture the important characteristics of speech. This is expressed in the mel-frequency scale (linear frequency spacing below 1000Hz and a logarithmic spacing above 1000Hz).
9. Mel-frequency cepstrum coefficients processor
The main purpose of the MFCC processor is to mimic the behaviour of the human ear. The input speech signal is sampled and sampling frequency is chosen to minimize the effects of aliasing in the analog to digital conversion.
10. Framing
In this step the continuous speech signal is blocked into frames of N samples, with adjacent frames being separated by M (M < N). Windowing we window each individual frame so as to minimize the signal discontinuities at the beginning and end of each frame. The concept here is to minimize the spectral distortion by using the window to taper the signal to zero at the beginning and end of each frame. Hamming window is used, which has the form: Fast Fourier Transform (FFT) The next processing step is the Fast Fourier Transform, which converts each frame of N samples from the time domain into the frequency domain. The FFT is a fast algorithm to implement the Discrete Fourier Transform (DFT)
11. Mel-frequency Wrapping Human perception of the frequency contents of sounds for speech signals does not follow a linear scale. Thus, for each tone with an actual frequency, f, a subjective pitch is measured on a scale called the ‘mel’ scale.Filter bank has a triangular bandpass frequency response, and the spacing is determined by a constant mel frequency interval. Cepstrum In this final step, we convert the log mel spectrum back to time. The result is called the mel frequency cepstrum coefficients (MFCC).Because the mel spectrum coefficients (and so their logarithm) are real numbers, we can convert them to the time domain using the Discrete Cosine Transform (DCT).
12. Feature matching Vector quantization(VQ)approach is used for its ease of implementation and high accuracy. It is a process of mapping vectors from a large vector space to a finite no. of regions in that space. Each region is called a cluster and can be represented by its centre called codeword. The collection of codeword is called codebook. Codebook effectively reduces the amount of data by preserving the essential information of the original distribution.
13. Thank You.