Speech-to-Text Transcription in Voicebots

Updated 

The ASR (Automatic Speech Recognition) or STT (Speech-to-Text) technology transforms speech signals into digital signals and processes them to generate transcriptions corresponding to the audio as output.

Language Support

We offer standard ASR models for a diverse range of languages, including English, German, Arabic, French, Spanish, Hindi, Indian Regional Languages (such as Tamil, Telugu, Gujarati, Marathi, Kannada, Malayalam), Cantonese, Japanese, Korean, Portuguese, Italian, Russian, Kazakh, Chinese (Mandarin), and Croatian.

Third-Party Integration

We offer support for both third-party ASR integrations from various sources and in-house models developed by the client, if any.

Difference Between Standard vs Customized Models for Speech Analytics Applications

  • Standard models lack fine-tuning on client-specific data and customer centre calls.

  • Customized models are boosted using domain-specific words/keywords related to the client or industry of operations. This optimization improves the accuracy of predictions for these words, thereby reinforcing speech analytics models.

  • Custom models exhibit a bias towards specific intents, keywords, and phrases essential for the successful implementation of Speech Analytics and Voice Bot implementations.

  • Standard models may lack the inclusion of the client's brand/product names, potentially leading to inaccuracies and misinterpretations in real-time scenarios.

  • Standard models do not provide support for hinting and contextualization in Voice Bot implementations. Hinting and contextualization represent an innovative approach where ASR models receive cues from the Voice Bot regarding expected user utterances based on the specific questions posed by the Voice Bot. As an illustration, the Bot offers hints to the ASR model at nodes where it seeks information like a phone number.

When to Submit a Fine-Tuning Request

For general use cases, the initial preference should be given to a standard model, and deployment requests should be raised accordingly. This standard model must then undergo benchmarking specific to the use case, especially if implemented for a Voice Bot.

Subsequently, the accuracy of the standard model needs to be benchmarked according to the use case. If persistent issues arise in the model, a ticket for fine-tuning should be raised, accompanied by the primary reason for fine-tuning and the preferred additional resources.

Below are examples illustrating the scenarios for raising fine-tuning requests:

Use Case

Possible Reasons for Fine-Tuning

Preferred Additional Resources

Voice Bot

Intents keywords not predicted

Dialogue intent keywords being mistranscribed

Lucid chart of the Voice Bot along with a list of variations of intent keywords

List and variations of dialogue intents configured

RTS/Speech Analytics

Domain words not predicted 

List of primary domain words such as Brand Names/Possible SA Intents  

Quality Management

QM parameters not being detected due to mistranscription of action keywords  

A list of all the QM parameters along with possible variations of keywords associated with each parameter

Custom Training an ASR Model for Speech Analytics Use Case

Steps involved in fine-tuning the models:

  1. Data Gathering:

    1. Client calls are transferred via an SFTP folder.

    2. Calls should be shared following the format specified at the end of this article.

  2. Annotation: 

    1. Primarily performed using call audios provided by the client.

    2. Audios are then segmented by VAD (Voice Activity Detection) and sent for annotation by language experts.

    3. Thorough quality assurance (QA) is conducted upon completion of each batch of annotation to ensure high accuracy.

    4. The annotated data is utilized to fine-tune the model, incorporating insights from real client calls.

  3. Recording:

    1. A comprehensive list of domain-specific words is curated to prioritize accuracy for the ASR model's use case.

    2. Sentences are constructed with variations for each of the mentioned words.

    3. The constructed sentences are then recorded and utilized to fine-tune the model specifically for these words.

  4. Vocabulary Validation/Convergence or Rule-Based:

    1. In certain languages, an additional step may be necessary to validate and ensure consistency in the vocabulary across predictions.

    2. For specific use cases, such as Word2Num and symbols, a rule-based approach is implemented to address unique requirements.

Measuring the Accuracy of the Model

Word Error Rate (WER): This metric holds particular significance in real-time speech (RTS) and Speech Analytics (SA) implementations, where the precise transcription of sentences may not be critical for maintaining the flow of conversation.

Sentence Error Rate (SER): This metric gains prominence in voice bot implementations, where the accuracy of entire sentences significantly influences the overall flow of the bot's conversation.

F1 Score on Keywords: This metric is calculated for a specific keyword list crucial to the use case or client. It ensures accurate predictions in transcripts for essential elements such as brand names, client products or services, and bot intents.

Contextualization and Hinting

For Voice Bot implementations, there arises a need for contextualization to enhance the accuracy of specific aspects or parameters in ASR model predictions, ultimately contributing to an elevated overall bot accuracy. The incorporation of hinting in the ASR model is pivotal for achieving context-based precision.

Identifying the Need for Hints

The necessity for hints becomes apparent when the bot journey encompasses nodes where user input is anticipated to adhere to certain grammatical and/or structural guidelines in the reply. For example, instances involving assertions, phone numbers, order IDs, PNRs, etc., prompt the addition of hints in the bot's configuration. These hints are then transmitted to the ASR model whenever the corresponding node is activated for predictions, ensuring accuracy in that particular parameter.

Examples of Hints: Several common hints include:

Assertion

Shop name*

Product

Service

Email ID

Alpha_numeric_5/PNR

Alpha_numeric_10/Order ID

Password

Numeric_10/Phone number

Numeric_5/Customer ID

Numeric

Names

Dates

*Names or proper nouns can achieve decent accuracy in finetuning only when a specified list of these names is available, as opposed to a generic case, due to the absence of an exhaustive dataset.

Process for Raising a Request

To initiate a request, please follow the steps outlined below:

Questions to Address in the Request:

<Specify the questions that need to be answered during the request process>

Details to Provide

Primary Reason for Fine-Tuning:

Name of the Client:

Domain:

Brief Description of the Client:

Language/s:

Accent:

Success Criteria:

Other Requirements (Preferred Resources for Fine-tuning):

VB Flow/Lucid (If Specified) + Intent List:

Hints (If Any):

Partner ID:

Environment ID:

Client Data Availability:

Client Data Pipeline/Link:

Domain Words List (If Specified):

Data Requirements

For optimal ASR model development, we require a dataset with the following specifications:

Audio Duration

- 100+ hours of audio data.

Audio Format

- WAV format.

- Sample rate of 8000 Hz or higher.

- Stereo configuration.

Acceptable Formats

- PCM S16LE.

- PCM A-law.

- PCM mu-law.

Data Representation

- Representative of actual production data and use cases.

- Inclusion of audio samples for each intent/use case.

Speakers and Diversity

- Inclusion of different speakers.

- Well-distributed representation of both male and female speakers.

- Variation in speakers' locations, accents, and languages.

Meta Information

- A unique identifier for different advisors (Mandatory/Good to have).

- Call date and time (Mandatory/Good to have).

- Additional metadata for mining insights (optional):

- Advisor gender.

- Caller ID.

- Location of the contact center.

Additional Text Corpus for STT Model Fine-Tuning

- Emails.

- Documentation provided to customer care executives during onboarding.

- Common phrases or domain words commonly used in contact centers.

FAQs

What are Stereo Calls?

Stereo, or stereophonic sound, involves using two audio channels for separate speakers (customer and agent). In contrast, mono or monophonic audio is a single-channel with both speakers mixed into one signal, creating a perception of sound emanating from a single position. Stereo audios are exclusively used and required for ASR finetuning.

Does Sprinklr Provide Support for Third-Party ASR Models?

Yes, Sprinklr supports third-party ASR models and is open to in-house or other third-party integrations as requested.

Does Sprinklr Support Multi-Lingual Models?

Yes, Sprinklr supports multi-lingual models, but timelines may vary accordingly.

What are the Expected Timelines?

The expected timelines range from 4-8 weeks depending on the language.

What is Considered Good Accuracy for an ASR Model?

The accuracy of the final finetuned model depends on factors such as language, annotation accuracy, and training data. Expected accuracy varies accordingly; for example, English models generally achieve higher accuracy (WER<10%) with minimal training, while Arabic models may have a higher error rate (WER»20%, CER»15%) even after multiple rounds of annotation and finetuning.

Why Does Sprinklr Need Customer Data? Are There Security Concerns?

Sprinklr only uses client-provided data to finetune ASR models. The data, voices, and calls are not exposed for commercial use. Stringent policies and NDAs signed by annotation resources ensure 100% security when handling client data.

Why is a List of Domain Words/Additional Resources Required for Fine-Tuning?

For keyword-based models used in final use cases (e.g., voice bot intents), certain words need precise transcription. Providing a list of variations for these keywords helps in training the model, boosting accuracy in the final output.