Speech-to-Text Transcription
Updated
The ASR (Automatic Speech Recognition) or STT (Speech-to-Text) technology transforms speech signals into digital signals and processes them to generate transcriptions corresponding to the audio as output.
Difference Between Standard vs Customized Models for Speech Analytics Applications
Standard models lack fine-tuning on client-specific data and customer center calls.
Customized models are boosted using domain-specific words/keywords related to the client or industry of operations. This optimization improves the accuracy of predictions for these words, thereby reinforcing speech analytics models.
Custom models are biased toward recognizing intent, keywords, and phrases essential for effective speech analytics.
When to Submit a Fine-Tuning Request
For general use cases, the initial preference is given to a standard model, and deployment requests is raised accordingly. This standard model must then undergo benchmarking specific to the use case, especially if implemented for a Voicebot.
Subsequently, the accuracy of the standard model needs to be benchmarked according to the use case. If persistent issues arise in the model, a ticket for fine-tuning should be raised, accompanied by the primary reason for fine-tuning and the preferred additional resources.
Following are the examples illustrating the scenarios for raising fine-tuning requests:
Use Case | Possible Reasons for Fine-Tuning | Preferred Additional Resources |
Voice Bot | Intents keywords are not predicted. Dialogue intent keywords are mistranscribed. | Lucid chart of the Voice Bot along with a list of variations of intent keywords. List and variations of configured dialogue intents. |
RTS/Speech Analytics | Domain words are not predicted. | List of primary domain words such as Brand Names or Possible SA Intents. |
Quality Management | QM parameters are not detected due to mistranscription of action keywords. | A list of all the QM parameters along with possible variations of keywords that are associated with each parameter. |
Custom Training an ASR Model for Speech Analytics Use Case
Steps involved in fine-tuning the models:
Data Gathering:
Client calls are transferred via an SFTP folder.
Calls are shared following the format specified at the end of this article.
Annotation:
Primarily performed using call audios provided by the client.
Audios are then segmented by VAD (Voice Activity Detection) and sent for annotation by language experts.
Thorough quality assurance (QA) is conducted upon completion of each batch of annotation to ensure high accuracy.
The annotated data is utilized to fine-tune the model, incorporating insights from real client calls.
Recording:
A comprehensive list of domain-specific words is curated to prioritize accuracy for the ASR model's use case.
Sentences are constructed with variations for each of the mentioned words.
The constructed sentences are then recorded and utilized to fine-tune the model specifically for these words.
Vocabulary Validation/Convergence or Rule-Based:
In certain languages, an additional step may be necessary to validate and ensure consistency in the vocabulary across predictions.
For specific use cases, such as Word2Num and symbols, a rule-based approach is implemented to address unique requirements.
Measuring the Accuracy of the ASR Model for Speech Analytics Application
To assess accuracy, a portion of annotations is kept separate from the training dataset, forming a 'Holdout Set.' This set is consistently used during each new model iteration to validate accuracy. The model's predictions on this holdout set are compared with the 'Ground Truth' obtained from annotation results.
The Word Error Rate (WER) is then calculated using the formula:
WER = (#insertion + #deletion + #substitution) / Number of words in the reference
Additionally, the F1 score is frequently employed to evaluate the accuracy of the model for the provided domain words list.
Note: In specific languages, such as Cantonese or Mandarin, where characters play a crucial role, metrics like Character Error Rate (CER) or Sentence Error Rate (SER) may offer a more effective demonstration of the accuracy of the Automatic Speech Recognition (ASR) model.
Language Support
We offer standard ASR models for a diverse range of languages, including English, German, Arabic, French, Spanish, Hindi, Indian Regional Languages (such as Tamil, Telugu, Gujarati, Marathi, Kannada, Malayalam), Cantonese, Japanese, Korean, Portuguese, Italian, Russian, Kazakh, Chinese (Mandarin), and Croatian.
Third-Party Integration
We offer support for both third-party ASR integrations from various sources and in-house models developed by the client, if any.
Process for Raising a Request
To initiate a request, please follow the steps outlined below:
Questions to Address in the Request:
<Specify the questions that need to be answered during the request process>
Details to Provide
Primary Reason for Fine-Tuning:
Name of the Client:
Domain:
Brief Description of the Client:
Language/s:
Accent:
Success Criteria:
Other Requirements (Preferred Resources for Fine-tuning):
VB Flow/Lucid (If Specified) + Intent List:
Hints (If Any):
Partner ID:
Environment ID:
Client Data Availability:
Client Data Pipeline/Link:
Domain Words List (If Specified):
Data Requirements
For optimal ASR model development, we require a dataset with the following specifications:
Audio Duration
- 100+ hours of audio data.
Audio Format
- WAV format.
- Sample rate of 8000 Hz or higher.
- Stereo configuration.
Acceptable Formats
- PCM S16LE.
- PCM A-law.
- PCM mu-law.
Data Representation
- Representative of actual production data and use cases.
- Inclusion of audio samples for each intent or use case.
Speakers and Diversity
- Inclusion of different speakers.
- Well-distributed representation of both male and female speakers.
- Variation in speakers' locations, accents, and languages.
Meta Information
- A unique identifier for different advisors (Mandatory or Good to have).
- Call date and time (Mandatory or Good to have).
- Additional metadata for mining insights (optional):
- Advisor gender.
- Caller ID.
- Location of the contact center.
Additional Text Corpus for STT Model Fine-Tuning
- Emails.
- Documentation provided to customer care executives during onboarding.
- Common phrases or domain words commonly used in contact centers.
FAQs
What are Stereo Calls?
Stereo, or stereophonic sound, involves using two audio channels for separate speakers (customer and agent). In contrast, mono or monophonic audio is a single-channel with both speakers mixed into one signal, creating a perception of sound emanating from a single position. Stereo audios are exclusively used and required for ASR finetuning.
Does Sprinklr Provide Support for Third-Party ASR Models?
Yes, Sprinklr supports third-party ASR models and is open to in-house or other third-party integrations as requested.
Does Sprinklr Support Multi-Lingual Models?
Yes, Sprinklr supports multi-lingual models, but timelines may vary accordingly.
What are the Expected Timelines?
The expected timelines range from 4-8 weeks depending on the language.
What is Considered Good Accuracy for an ASR Model?
The accuracy of the final finetuned model depends on factors such as language, annotation accuracy, and training data. Expected accuracy varies accordingly; for example, English models generally achieve higher accuracy (WER<10%) with minimal training, while Arabic models may have a higher error rate (WER»20%, CER»15%) even after multiple rounds of annotation and finetuning.
Why Does Sprinklr Need Customer Data? Are There Security Concerns?
Sprinklr only uses client-provided data to finetune ASR models. The data, voices, and calls are not exposed for commercial use. Stringent policies and NDAs signed by annotation resources ensure 100% security when handling client data.
Why is a List of Domain Words/Additional Resources Required for Fine-Tuning?
For keyword-based models used in final use cases (For example, voice bot intents), certain words need precise transcription. Providing a list of variations for these keywords helps in training the model, boosting accuracy in the final output.