Introduction
Embeddings are high-dimensional vectors that project information into embedding space, enabling the measurement of similarity between diverse pieces of information. This principle is foundational to numerous natural language processing (NLP) tasks and models, notably including large language models (LLMs). By leveraging embeddings, these models achieve a high degree of precision in understanding and generating human language. The analysis of embeddings facilitates a range of complex NLP tasks, such as semantic search, sentiment analysis, and language translation, by capturing the subtleties and contexts of words or phrases. This mechanism underscores the sophistication of modern AI systems in mimicking human linguistic capabilities, showcasing the intricate interplay between mathematical representations and language understanding.
A Brief History of Embeddings
Conventional approaches to generating embeddings have evolved over time, from simple one-hot encoding to more sophisticated techniques that capture the complexity and nuances of language. Here are a few notable methods:
One-Hot Encoding: Represents words as sparse vectors where each word is represented by a vector of size equal to the vocabulary, with a 1 in the position corresponding to the word in the vocabulary and 0s everywhere else. This method, however, doesn’t capture semantic relationships between words.
TF-IDF (Term Frequency-Inverse Document Frequency): A numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus). It enhances the vector representation of text by considering the frequency of words while also accounting for their commonness across documents.
Word2Vec: A machine learning technique that clusters words with similar contexts into embeddings, placing semantically related words close together in a high-dimensional space. This process is based on the contextual surroundings of words in texts. It employs a two-layer neural network that transforms raw text into a vector space, assigning each unique word a specific vector. The arrangement of these vectors reflects the words’ meanings and relationships.
GloVe (Global Vectors for Word Representation): An unsupervised learning algorithm for obtaining vector representations for words by aggregating global word-word co-occurrence statistics from a corpus. The resulting embeddings reflect the probabilities that two words will co-occur, capturing both their semantic and syntactic similarities.
Large Language Models (LLMs) Approach:
With the advent of LLMs like BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), and their successors, the approach to embeddings has shifted significantly. LLMs have the following advantages over classical embeddings approaches:
Contextual: Unlike previous methods that offer a single static representation for each word, LLMs provide dynamic embeddings based on the word’s context in a sentence. This means that the same word can have different embeddings depending on its usage, allowing for a much richer representation of language nuances.
Transformer Architectures: At the heart of these models lies the transformer architecture, which uses self-attention mechanisms to weigh the importance of different words in a sentence when generating an embedding for a given word. This allows the model to consider the full context of a word by looking at all the words in a sentence simultaneously, rather than in isolation or in a fixed-size window.
Pre-training and Fine-tuning: LLMs are typically pre-trained on a vast corpus of text data in an unsupervised manner, learning a general language model. They are then fine-tuned on specific tasks with smaller labeled datasets, allowing the embeddings to be adapted for particular applications while leveraging the rich linguistic knowledge acquired during pre-training.
LLM-Embeddings in Action
As we can see the expected answer has a higher similarity than the unrelated text. You can imagine that on a larger scale embeddings would allow us to retrieve related information from a vector database.