The AI-first unified platform for front-office teams

Consolidate listening and insights, social media management, campaign lifecycle management and customer service in one unified platform.

Request Demo
Wells Fargo LogoSonos logoHondaLoreal logo
Platform Hero
Platform & Technology

Unified Vector Representation for Noisy Articles

July 22, 20248 MIN READ

If you're dealing with documents, conversations or articles with valuable content but cluttered with noise, and you aim to perform multiple tasks or build various products based on them, you're in the right place.   

In this blog post, we'll outline how to construct meaningful representations of such content for tasks like product detection, intent detection, sentiment analysis and more. What's more, we will also discuss how these representations can be effortlessly adapted for new tasks using a lightweight process. So, let’s get started.  

The background 

Over the past eighteen months, the AI landscape has been predominantly shaped by Large Language Models (LLMs), with a competitive push towards developing expansive models featuring billions of parameters to tackle a diverse array of tasks. 

This surge in LLM development has spotlighted the need for cost-effective deployment strategies, leading to the adoption of techniques like variable Large Language Models (vLLM) and Machine Learning Compression (MLC).  

Additionally, the LLM boom has significantly advanced methods for generating embeddings — a critical component in numerous Natural Language Processing (NLP) applications — from lengthy, unstructured documents. 

Generalized conversation embeddings can help: 

  • Make the model for downstream tasks (like intent recognition/sentiment detection etc.) light weight so embeddings can be reused. 
  • Facilitate unsupervised discovery over conversations like Knowledge Base discovery, QM Checklist discovery, FAQ discovery etc. 

Contemporary solutions to embeddings 

Embeddings can model images, voice and videos but in this article, we would be focusing on text embeddings. Below are some prevalent techniques associated with text embeddings: 

Sparse vectors 

The easiest way to represent a text is by using a vector of size vocabulary where each dimension represents one word from your vocabulary set and its value is set as the count of the occurrence of that word.   

There are other sophisticated techniques for building sparse vectors like TF-IDF and BM25. You can learn more about these techniques here.  The two biggest shortcomings of the sparse vector technique are: 

  • The dimensionality of the vector changes with vocabulary size. 
  • The vectors have no notion of semantic meaning. 

Dense vectors 

“The boy ____ to party."

If we are to fill in the blanks, words such as "loves" or "likes" will readily come to mind. Given their contextual appropriateness, it stands to reason that the embedding of the chosen word should closely align with the embeddings of these suggestions. Thus, word embeddings in a given context tend to cluster around semantically similar concepts, underscoring the nuanced understanding of language that embeddings capture.  This leads us to the following conclusion: 

A Word is defined by the context it appears in.  

Introducing Word2Vec: At the heart of this approach lies the Skip-Gram model, which seeks to determine a word's embedding based on its surrounding context. The model aims to enhance the cosine similarity between embeddings of words that appear in close proximity, while reducing the similarity among those that are more distant. Below is a straightforward example of how this is implemented:   

The model is a Black Box here, generating embeddings for each word with the cosine similarity being calculated on a pair of words. Such a network is known as a Siamese Network and is the most popular style for training embedding networks. The weights are shared between both the model’s instances so that embeddings are pair-position agnostic.  

Apart from word2vec, BERT is a neural model for word embeddings. It is trained using Masked Language Modelling where a word is predicted based on context and Next Sentence Prediction. We have already understood the former, let us dive deep into the latter.  

Sentence embeddings 

Having explored the generation of word embeddings, we now turn our attention to crafting sentence embeddings. Just as the relationships among words within a sentence informed our approach to word embeddings, a similar principle applies to sentences within a document. The intuition here is that sentences appearing in close proximity are likely to have related meanings, and thus, their embeddings should reflect this similarity. To achieve this, we leverage sentence encoders, which are adept at capturing the nuances of sentence structures within larger texts. Renowned examples of such encoders include SentenceBert and all-mpnet-base-v2, both of which are accessible through the Sentence Transformers library on Hugging Face. 

The main idea is to create a dataset of pairs of sentences, some of which co-occur while others don’t, and use a Siamese network to encode them (a <bos>/<eos> token is added along with sentence and the embeddings of this token are used as sentence embedding. This is because, for a bi-directional transformer, all context is shared across all tokens.) 

Generating embeddings for noisy documents  

Building on the above concepts, we can further apply these principles to generate embeddings for entire documents by analyzing groups of sentences or patches.  

For example, one could randomly select sequences of 512 consecutive tokens within a document and classify them as similar or dissimilar, contingent upon a predefined similarity threshold. This piece will delve into the process of creating embeddings specifically for conversations, leveraging the foundational ideas of sentence and document embeddings to capture the dynamic nature of dialogs. 

In the current digital landscape, conversations are ubiquitous, particularly with the advancements in Large Language Models. Generating embeddings from these conversations can serve multiple use cases including generating insights, detecting intent and recognizing patterns across various platforms like social media and interactive voice response (IVR) systems.

However, with the amplification of conversation length and the introduction of noise, the quality of embeddings tends to deteriorate significantly. Traditional models, designed to capture all facets of information, from core dialog to peripheral elements like greetings and salutations, often falter when tasked with longer, more complex exchanges.  

In the following discussion, we will outline methods for developing specialized, high-quality embeddings tailored to specific tasks, thereby enhancing their performance and utility. 

Pre-Requisites 

Readers are advised to read through sentence transformer losses to understand different kinds of losses. Two interesting types of losses are 

  • TripletLoss 
  • BatchAllTripletLoss 

BatchLosses tend to work better as there are more instances to optimise at each step (more examples to compare and bring together/far-away) but this requires an implicit knowledge of each class the dataset sample belong to. 
But does it? Below is a brief description of how we can circumvent this limitation.  

Modelling conversations 

In order to remove noise from conversations, we used GPT-3.5-Turbo to summarize conversations to identify elements like the topic of discussion, issues being discussed, resolutions etc.   

The summary in this context is designed to capture task-specific details pertinent to the intended application. For instance, when seeking embeddings suitable for an intent detection model, the summary should encapsulate information about the intent itself. To this end, one could employ a general-purpose document embedding model, such as all-mpnet-base-v2, to process these summaries.   

Thus, the problem we are addressing is as follows (given the conversations and their respective summaries):  

Train the conversational embeddings to be same as those of their Summaries  

Trick: The BatchAllTripletLoss training methodology necessitates a class label for each instance; however, in our scenario, such labels are absent, as the class membership of each summary is unknown. A rudimentary approach might involve performing clustering on the summaries and utilizing the resulting cluster labels as a stand-in for class labels. Yet, this method is often suboptimal, as clustering can be an imprecise operation. A more refined strategy would be to dynamically generate labels during runtime for the instances within a specific batch.  

Implementation:  We store the similarity matrix of all summaries and whenever we get a batch, we fetch its corresponding sim matrix. Based on certain threshold values, we decide whether a triplet (Implementation assume NxNxN possible triplets for a batch size of N) is a valid triplet or not (validity is decided based on whether first 2 samples are similar and 1st and 3rd dissimilar)

def get_triplet_mask(labels, cs_scores): 

"""Return a 3D mask where mask[a, p, n] is True iff the triplet (a, p, n) is valid. 
A triplet (i, j, k) is valid if: 
- i, j, k are distinct 
- labels[i] == labels[j] and labels[i] != labels[k] 
Args:  
labels: tf.int32 `Tensor` with shape [batch_size]
cs_scores: a matrix having sim scores of size batch_size*batch_size 
""" 
# Check that i, j and k are distinct 
indices_equal = torch.eye(labels.size(0), device=labels.device).bool() 
indices_not_equal = ~indices_equal 
i_not_equal_j = indices_not_equal.unsqueeze(2) 
i_not_equal_k = indices_not_equal.unsqueeze(1) 
j_not_equal_k = indices_not_equal.unsqueeze(0)


distinct_indices = (i_not_equal_j & i_not_equal_k) & j_not_equal_k

labels_n=labels.cpu() 
score_matrix = torch.tensor(cs_scores[np.ix_(labels_n,labels_n)]).to(torch.device("cuda:0")) 
positive_mask = torch.ge(score_matrix, 0.8) 
negative_mask = torch.le(score_matrix, 0.5)

i_equal_j = positive_mask.unsqueeze(2) 
i_notequal_k = negative_mask.unsqueeze(1)

valid_labels = i_notequal_k & i_equal_j 

return valid_labels & distinct_indices

The above code indicates the major changes that need to be affected in the library implementation to get a new loss function that calculates similarity at runtime. 

class SmartBatchAllLoss(nn.Module): 
def __init__(self, model: SentenceTransformer, distance_metric=eucledian_distance, margin: float = 5): 
super(SmartBatchAllLoss, self).__init__() 
self.sentence_embedder = model 
self.triplet_margin = margin 
self.distance_metric = distance_metric 
self.cs_scores = pkl.load(open("./cs_scores_v2.pkl","rb"))  

def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor): 
rep = self.sentence_embedder(sentence_features[0])['sentence_embedding'] 
return self.batch_all_triplet_loss(labels, rep)

Results: 

In order to test the results, a custom dataset was constructed in the form of pairs of conversations and each pair was tagged as SIMILAR or DISSIMILAR based on their content (Here 2 conv were marked as similar if they discussed about same product and same issue). Roughly 50% data was chosen for each class.

Below is shared a graph plot of Accuracy vs. Thresholds. Threshold represents the similarity value at which a pair of conversations is marked as similar. Mpnet_model represents the pretrained model available online model_hardlabel represents the model trained by clustering the summaries of conversations and using those labels in Loss 

model_softlabel represents the method we discussed above. 

As can be seen from the graph, even at very low threshold (at 0 means everything would be categorised as SIMILAR and the model would have only 50% accuracy, and same thing at 1 threshold ), the model has high numbers which means the model is very confident about its decisions as it is assigning very LOW SCORES for DISSIMILAR conversations and HIGH SCORES for similar conversations. 

Conclusion: 

 The above experiments shows that the best way to extract conversation embeddings is by leveraging its relations with clean summaries. Some use cases where this approach would be highly useful: 

  •  Reducing Latency 
  • A common entry point/embedding for tasks like Intent detection, Sentiment, Product detection etc.   
  • A way to extract most similar conversations during live chat in real time ex: for in agent assist to find the ideal control flow to follow etc. 
  • Improving accuracy 
  • Provides highly clean embeddings which improves any solution’s accuracy based on clustering/aggregation of conversations. Examples like 
  • Intents Detection Pipeline 
  • Agent Quality issues extraction 

Table of contents

    Your teams can be up to 40% more productive

    Test the unified power of Sprinklr AI, Google Cloud’s Vertex AI, and OpenAI’s GPT models in one dashboard.

    Request Demo
    Share This Article