AUTOMATIC SPEECH RECOGNITION APPROACH FOR DIVERSE VOICE COMMANDS

: To underseek and resolve the algorithms of speech recognition is the aim of this work done. MATLAB is used to program and simulate the forth put algorithm . The two systems are created in this work. First system rely on the information of shape of cross-correlation plotting and second is used in finishing successfully the speech recognition by using the Weiner Filter. In the simulation, the spoken words are recorded using microphones. If the speaker is the same person for three time recordings, and the success for this approach is very high. Thus, the designed systems work accurately for the basic speech recognition


INTRODUCTION
Speech recognition is considered as one of the great inventions using present day Computer Systems and other pervading devices.It has created new world of opportunities and contingency for software and hardware developers around the globe especially those building IVRs and other telephony applications.Building such a huge speech recognition applications has given rise to many internal and external confrontation .Rather pressing buttons of a computer or a smart pervasive device and envisaging output on a computer screen, the modern users must speak to the computer via a microphone, and this creates level of uncertainty in the input itself, as automatic speech recognition process using uncertainties or likelihood to arrive at certain speech recognition.These processes and methods have joined many strengths and weaknesses with the speech recognition process.The most obvious weakness is the uncertainty in the speech input process, namely the potential for misrecognition.It requires a substantial judging, effort and care in developing a piece of speech recognition software module, but still there are always instances when the application misrecognizes user speech input.This needs greater error handling mechanisms to put in the speech recognition module in comparison to other software applications.If the confidence score on a specific recognition process is low, it becomes important to confirm the user input speech.The system may have to ask users to repeat the input speech to the corresponding speech recognition applications so as to enhance the confidence score.Many a times user'sspeech will not be understood by the system reason being a noisy environment.If a speech engine returns low confidence values for the same user several times, it may be imperative to transfer that user to a human operator so the user can conduct his or her transaction.Speech recognition has become a house hold application nowadays.Speech recognition devices are equipped in modern electronic gadgets.Internet is flooded with audio data and software for speech detection and recognition.Speech makes it more convenient to operate electronic systems instead of typing with the keyboard or operating with buttons.Voice recognition system nowadays has numerous applications which requires interconnection such as automatic call processing, query-based information systems, weather reports etc [1].
With the speech recognition systems,the lives of human is getting better in modified manner.The dramatic progress in voice recognition technology has been seen by past decade , to such an range that high-performance algorithms as well as the systems have become approachable.The efficiency of the daily life raises as well as makes people's life more modified.Speech recognition is the technology with the help of which a computer can associates the components of human speech [3].Speech recognition is one of the many available biometric recognition schemes [4].The process begins by capturing the spoken utterance using a microphone and to end with the notable words being output bythe system.Speech is technically defined as a sequence of basic units called phonemes [5].Automated Speech Recognition (ASR) systems executes conversion of analog speech signals received through microphones to the digital signals which are then segmented to regain phonemes.The ASR system refers to the vocabulary and grammar rules to decode words or phrases using the phoneme sequence. .

SPEECH SIGNAL REPRESENTATION
The speech signal has a feature that it not only gives the information regarding the words or message being spoken but also the identity of the speaker.This speaker identification is done by representingspeech signal in terms of certain features which are grouped into feature vectors that serve to decreasedimension and redundancy in the input to the speaker identification system, while retaining the speaker-specific information.However the irrelevant information with regards to speaker discrimination is a common problem for all feature sets,it is the topic of ongoing research which strives to determine feature sets of very less complicacythat can be applied to speaker identification [15].The nature of these feature set depends on which part of a speech signal the features are expected to potray and thus the type ofinformation which is to be extracted.Thus due to this reason feature sets can be grouped as source basedfeatures or the system based features.The source is described as being the actualsound wave that is transmitted from the diaphragm through the glottis and so these feature are involved to determine the characteristics of the vocal cords, where this waveform is shaped.The fundamental frequency is the most feasible parameter that can be determined.The system characteristics can be extracted for the vocal tract, the nasal cavity and the lip radiation.These features model the filter characteristics of the vocal tract which can be derived from information contained in both voiced as well as unvoiced speech.The physiology of the speaker is reflected by the system features.For every feature extraction method, it is important to know that exactly what is being extracted to avoid defect of accuracy and ambiguity.When performing the signal processing analysis, the information of the DC level for the target signal is not that useful except the signal is applied to the real analog circuit, such as AD convertor, which has the requirement of the supplied voltage.When analyzing the signals in frequency domain, the DC level is not that useful.Sometimes the magnitude of the DC level in frequency domain will interfere the analysis when the target signal is most concentrated in the lower frequency band.In WSS condition for the stochastic process, the variance and mean value of the signal will not change as the time changing.So the author tries to reduce this effect by deducting of the mean value of the recorded signals.This will remove the zero frequency components for the DC level in the frequency spectrum [2].A spectrogram is used which is a short-time Fourier transform that shows the energy of a signal as a function of positive time and frequency [8], thus allowing us to locate areas of energy in the speech signal.It only represents the amplitude of the speech signal, as no phase information is retained.phase information is not important for discrimination between speakers [7], so it can be omitted for making calculations simple, i.e., the magnitude of the spectrum of the speech signal is used.

SPEECH RECOGNITION PROCESS
Speech recognition process requires frequency analysis.The frequency analysis is carried out in MATLAB using the following processes.

i.
Spectrum Normalization Normalization maintain the measurement standard as comparing spectrums in different measurement standardscould be difficult when comparing the differences betweendifferent speech signals.Hence the method of normalization reduces the error when comparing thespectrums.When the normalization of the absolutevalues of FFT is performed, the next step in programming the speech recognition is observing the spectrums of the recorded signals.At last the algorithms are compared based on thedifferences between the test or target signal and the trainingsignals or reference signals [10].The error is reduced by normalization when comparing the spectrums, which is good for the speech recognition [11].So before subjecting the spectrum differences for different words, the first step is of using the linear normalization to normalize the spectrum.

ii. The Cross-correlation Algorithm
There is a significant quantity of data on the frequency of the voice fundamental (F0) in the speech of speakers who differ in age and gender [12].For the same loudspeaker system, the different words also have the different frequency bands which are referable to the different vibrations of the vocal cord.And the forms of thespectrums are also dissimilar.These are the foundations of this thesis for the speech recognition.In this thesis, to earn the speech recognition, there is a need to compare spectrums between the third recorded signal and the first two recorded reference signals.By checking which of two recorded reference signals better matches the third recorded signal, the system will make the judgment that which reference word is again read at the tertiary time.When thinking about the correlation of two signals, the first algorithm that will be considered is the cross-correlation of two signs.The cross-correlation function method is very useful to estimate shift parameter [13].Here the shift parameter will be referred as frequency shift.The defining equation of the cross-correlation of two signals is as under: From the equivalence, the principal idea of the algorithm for the cross-correlation is about  The FIR Wiener Filter The FIR wiener filter is shown below in fig 7 Then the estimation error is given as: The Wiener filter is used in choosing the suitable filter order and as well as in finding the filter coefficients with which the system can get the best estimation.In other words, with the proper coefficients the system can minimize the meansquare error: In order to get the suitable filter coefficients we minimize the MSE, there is a sufficient method for doing this is to get the derivative of to be zero with respect to w*(k).Then we get: The above equation is known as orthogonality principle or the projection theorem [14].After some proper rearrangement, the final equation becomes: This equation may be written in matrix form: The matrix equation is actually Wiener-Hopf equation [6] of: From the above equation, the input signal x (n) and the desired signal d (n) are the only things that need to know.Then using x (n) and d(n) finds the cross-correlation r dx .At the same time, using x (n) gives the auto-correlation r x (n) and this r x (n) forms the matrix R x in MATLAB.When having theR x and r dx , the filter coefficients can be obtained.Using the filter coefficients the minimum mean squareerror can be obtained.The minimum mean square-error is:

LITERATURE SURVEY
The Literature review on speech recognition systems orders consideration towards the finding of Alexander Graham Bell regarding the methodused for transforming sound waves into electrical impulses and the first speech recognition system matured by Davis et al. [6] for finding telephone superiority digits spoken at normal speech rate.This attempt for (ASR) automatic speech recognition was mainly centered on the abstract structure of an electronic circuit for revealing ten digits of telephone superiority.To obtain a 2-D plot of formant 1 vs. formant 2 , spoken words were inspected.For finding the greatest correlation coefficient among a set of novel incoming data for pattern matching a circuit was developed.These features are grouped into feature vectors whose goalis decreasing redundancy as well as dimensionality in the input to the speaker recognition system.An indication circuit was invented to display the spoken digit that was already discovered.The proposed way lays stress on acknowledging speech sounds andproviding suitable labels to these sounds.In last five decades various approaches and types of speech recognition systems came into state of being.This evolution has led to a noticable impact on the growtht of speech recognition systems for various languages worldwide.The exact nature of the feature set relyon which part of a speech signal the features are expected to potray and thus what type of information is to be sunder out.In process of conversion of speech to text , the output of the system shows the text which is used to apprehend the speech.Automatic speech recognition system has been created using language whichis a portion of total around 7300 existing languages which are Hindi, English, Tamil, Bengali, Russian, Japanese, Portuguese, Sinhala, Chinese, Malayalam, Vietnamese, Spanish, Arabic, Filipino, Hindi are well-known among them.Maximum work for recognition is done for English language.Since 1930s, a simple speech machine that answers to a limited small set of words was invented.This proposed machine is capable to take actions on spoken words and create the speech.From that time, it has become popular area of research for invention of speech recognition system.The best example for this is done by Olson and Belarin 1950 in RCA Laboratories who build a system to identify 10 syllables of a single talker (Olson et al.

PROPOSED METHODOLOGY
In this work, two designed systems are used for speech recognition.The learning in the theory part of this thesis were used by these two systems,which has been introduced at an earlier time.The two designed systems were tested by the author and her friends.For running the system codes at each time in MATLAB, MATLAB will ask the operator to record the speech signals for three times.The reference signalsconsists of the first two recordings and the third time recording is used as the target signal.

Algorithm for Design System 1:
1. Set the sampling frequency16 kHzafter assigning the variables.

RESULTS
The first two recordings in the process of speech recognition are used as reference signals.The third recording is used as the target signal for which MATALB should give the judgment.In the following results, the author uses "reference signals" to stand for the first two recordings and uses "target signal" to stand for the third recording.The words in the quotes stand for the contents of recordings.The author tried to test designed systems for both easily recognized words and difficultly recognized words."From time 1 to time 10, 'on'" in the following of the thesis means the operator simulated 10 times and the third recording word is "on" in the first 10 times simulations.Both the contents of the reference words and the target word are known, the author wants to test if the judgment that is given by MATLAB is correct as we know.The statistical simulation results will be put in tables and will also be plotted.In this Simulation Result part, only the plotted results will be shown in the following content.
The information of the first statistical simulation results for system 1 : Reference signals: "on" and "off": Target signal: From time 1 to time 10, "ON".
From time 11 to time 20, "OFF".Speaker: Speaker 1 for both reference signals and the target signal.umbworld around: 'ALMOST NO NOISE' Frequency spectrums for three recorded signals is potrayed in figure 8, but the axis is not the real frequency axis since the figure is got by STFT."on", "off", "on" .
The figure 9 shows the cross-correlations between the target signal "on" and the reference signals wherethe reference signal of the left plotting is "on"; the reference signal of the right plotting is "off": There is no large difference between two graphs as potrayed in figure 9 above, since the pronunciations of "on" and "off" are close.
Figure 10: Symmetric errors in 20 times simulations for reference "on" and "off" In Fig. 10, when the reference speech word is "on", the simulated result is shown by blue curve.The red curve is the simulated result when the reference speech word is "off".As information given at the start, the target speech word is "on" in the first 10 times simulations and the target speech word is "off" in the second 10 times simulations.From Fig. 10, it is shown that the reference signal "on" curve has lower value for the first 10 times and for the second 10 times the reference "off" curve has lower value.
The results have potrayed that the symmetric errors are smaller when the reference speech signal and the target speech signal are matched.The judgments are totally correct.
The information of the second statistical simulation results for system 1 is as following: with reference signals There is huge difference between two graphs as shown above in figure12.Since the pronunciations of word "Door" and "Key" are different.As introduced in theory part, the better matched signals have better symmetric property of the cross-correlation.The Figure .12proved this point.
Figure 13: Frequency shits in 20 times simulations for reference "Door" and "Key" From Fig. 13, we can see that there are huge differences in the frequency shifts.So the designed system will directly give the judgments according to the frequency shifts.3. The information of the third statistical simulation results for system 1 is as following: Reference signals: "on" and "off": Target signal: From time 1 to time 10, "on".From time 11 to time 20, "off".Speaker: Speaker 2 for both reference signals and the target signal.Umbworld around: appearance of some noise at sometimes Since "on" and "off" have small frequency shift difference, so the designed system will only give the result with symmetric errors.These symmetric errors in 20 time simulations for reference ON and OFF are shown in figure 14 .
Figure 14: Symmetric errors in 20 times simulations for reference "on" and "off" (noisy) As shown in Fig. 14, the blue curve is simulated result when the reference speech word is "on".The red curve is the simulated result when the reference speech word is "off".As the information given at the outset, the target speech word is "on" in the first 10 times simulations and the target speech word is "off" in the second 10 times simulations.From Fig. 14, it is shown that in the first 10 times simulations the reference "on" curve has a lower value and in the second 10 times the reference "off" curve has a lower value.The outcomes have indicated that when the reference speech signal and the target speech signal are fitted, the symmetric errors are smaller.The assessments are completely right.
(4) The information of the fourth statistical simulation results for system 1 is as following: Reference signals: "Door" and "Key": Target signal: From time 1 to time 10, "Door".From time 11 to time 20, "Key".Speaker: Speaker 2 for both reference signals and the target signal.umbworld around: existence of some noise sometimes .The plotted simulation result is as below: Figure 15: Frequency shits in 20 times simulations for reference "Door" and "Key" (noisy) Table 1 points out the simulation results for reference signals "Door" and "Key" as the information given at the beginning of this section.
Table 1: Simulation results for speech words "On", "Off", "Door" and "Key" Since the simulation results are not good as they were expected.So only the table results are shown here.

7.CONCLUSION
For indefinite conclusions, the noise easily diverts the designed systems for speech recognition, which can be observedfrom Table 1.For the designed system 1, the better matched signals have the better symmetric property of their cross-correlation.For the designed system 2, if the reference signal is the sameas the target signal, there will be smaller errors in using this reference signal to model the target signal.This outcome can be demonstrated by all the assumed results for the designed system 2. When two of reference signals and the target signal are recorded by the same person, two systems work well for distinguishing different words, no matter where this person is from.But if the reference signals and the target signal are recorded by the different people, the execution of both systems is not well.So in order to improve the performance of designed systems to make it work better, we need to increase the exemption of system against noise and to find the common characteristics of the speech for the different people.
Contrarily, the effect of input noise can be minimized by designing some analog and digital filters for processing the input signals which can also be used to form the large data base of the speech signals for different words.Studying more progressive algorithms for signal modeling can stipulate a lot of help to actualize the better speech recognition.
The linear normalization is represented by equation given below: After normalization, the values of the spectrum |X(ω)| are set into interval [0, 1].The normalization only changes the values' range of the spectrum, but does not change shape or the information of the spectrum itself.So for spectrum comparison normalization is better.The change in spectrum by the linear normalization is shown below in example.Firstly, record a speech signal and do the FFT of the speech signal.After that take the absolute values of the FFT spectrum.The FFT spectrum without normalization is as below:

Figure 1 :Figure 2 :
Figure 1: Absolute values of the FFT spectrum without normalization Using linear normalization for normalizing the above spectrum, the normalized spectrum is as below: 3 steps: Firstly, specify one of the two signals x (n) and switch the other signal y (n) left or right with some time units.Secondly, multiply the value of x (n) with the shifted signal y (n+m) position by position.At terminal, take the sum of all the multiplication results for x (n) • y (n+m).For instance, two sequence signals x (n) = [0 0 0 1 0], y (n) = [0 1 0 0 0], the lengths for both signals are N=5.Hence the cross-correlation for x (n) and y (n) is as the following figures shown:

Figure 7 :
Figure 7: Wiener filter The input signal of Wiener filter is x (n).Assume the filter coefficients are w (n).So the output d (n)' is the convolution of x (n) and w (n): , 1956) and at MIT Lincoln Lab, Forgie and Forgie built a speaker-independent 10-vowel recognizer (Forgie et al., 1956).It is continued by the middle of 70's.The new system of speech recognition depends on LPC methods.These were forthput by Itakura, Rabiner and Levinson (Itakura 1975; Rabiner et al., 1979) and others.This research bringsmain benefits where research shift the methodology from the more spontaneous template-based approach towards a more accurate statistical modeling outline (Juang et al., 2004) in 1980s.

2 . 9 .
Get returned matrix signals after processing the recorded signal.3. Get the frequency spectrum by swapping the input signal.4. Normalize the signal by process of Linear Normalization,whose equation is given as: 5. Execute the cross-correlation of the targeted signal with the first two reference signals separately.6. Check the frequency shift of the cross-correlations.7. Do the comparison by the symmetric property for the cross-correlations of the matched signals.The cross correlation of two signals is given by: Algorithm for Design System 2: 1. Assign the variables and set the sampling frequency equal to 16KHz .2. Record 3 voice signals.Make the first two recordings as the reference signals and the third recording as the target voice signal.3. Get returned matrix signals by processing the recorded signal.4. Get the frequency spectrum by interchanging the input signal.5. Use the linear normalization for normalizing the frequency spectrum.6. Compute the auto-correlations of 3 signals: 7. Using wiener filter mode compute the filter coefficient.8. Compute the minimum mean square-error for each reference Smaller the minimum mean square-errors ,the superior is the estimation value.

FrequencyFigure 9 :
Figure 9: Cross-correlations between the target signal "on" and reference signals

FrequencyFigure 11 :FrequencyFigure 12 :
Figure 11: Frequency spectrums for three signals: "Door", "Key", and "Door"The figure of cross-correlations for the target signal "Door" with reference signals 'key'is as below.The reference signal of the left plotting is "Door"; and the reference signal of the right plotting is "Key"):