To solve the problem of speech recognition in low-quality audio, we have used our proprietary data to develop a speech-to-text system for Urdu language. It takes a wav audio file as input and produces text as output. The proposed speech-to-text solution is robust to noise, which is reduced during pre-processing. Our speech to text system can be extended to other languages.
Developing an ASR system requires expertise in signal processing, machine learning, and natural language processing. Our STT system comprises of the following processes:
Noise removal is a crucial step in ASR to enhance recognition accuracy by eliminating distortions from speech. Our solution can minimize noise efficiently in most day to day scenarios while performing exceptionally for background speech.
Our system uses signal processing techniques to extract features, including Mel Frequency Cepstral Coefficients (MFCCs) and other time-frequency representations of speech.
Acoustic models are used to map acoustic features to phonemes or other speech units. We employ HMMs, DNNs, and other ML techniques to develop our acoustic model. In HMM-DNN, more hidden layers are used for identifying complex relationships. Input features are obtained from a larger time window for better context. To achieve optimal results, we have created custom acoustic model parameters.
ASR/STT systems utilize statistical properties of language to predict the likelihood of word sequences by analyzing text corpora. Models can use n-grams, RNNs, or other ML techniques. N-grams are very popular in this aspect where log probabilities for each n-gram are stored in log files.
ASR/STT systems use dynamic programming and beam search strategies to search for the most likely transcription. The beam search algorithm reduces the language model’s scale factor, identifying and keeping the top k words in the vocabulary for the first position. Conditional probabilities are calculated for subsequent words, retaining only the top k words.
ASR/STT systems require labelled speech data for training acoustic and language models obtained through manual transcription. Our system has been trained on 300 hours of custom data. To train an STT system, data collection is an important step. The data has to be pre-processed, and text-based annotation of data has to be created. This is followed by Grapheme-to-phoneme (G2P) conversion to create a phonetic representation of data using phonetic conversion rules and statistical analysis. Finally, different training models are utilized for developing a model. They may include but are not restricted to Monophone, Triphone, SGMM2, and Neural Network based Models.
The ASR system was tested on 14K audios using the Word Error Rate (WER) measure, which provides a consistent method for comparing system performance over time. The WER is approximately 20% and is calculated using the following equation.
Word Error Rate = (S + D + I) / (N)
S = Substitutions, D = Deletions, I = Insertion, N = Total number of words
- Speech pattern variability in Urdu is recognized by the ASR, resulting in excellent performance
- The system was trained using 300 hours of annotated data as compared to The Common Voice dataset for Urdu which has only 46 hours of validated training data
- Pre-processing stage to remove background noise, improving speech recognition quality
- Our ASR is context aware as it has been trained on specific test cases, improving performance for specific and general cases
- Capability: Convert speech to text in Urdu.
- Accuracy: Our Urdu ASR system has a word error rate of less than 20% on low quality audio making it competitive with some of the best ASRs in the world.
- Noise Reduction: Our noise reduction algorithm can suppress the background noise of call centers and thus increases the accuracy of the ASR.
- Customization: The ASR system can be quickly fine-tuned for specific test cases by using customer data thus making it customizable.
- Cloud Based Deployment: The system can be easily deployed on cloud infrastructure using our docker container.
- Efficient: The ASR can convert speech to text in real-time.
- Robustness: Capability of recognizing English words used in Urdu language.