In 2018, GoogleAI researchers released the BERT model. It was a fantastic work that brought a revolution in the NLP domain. However, the BERT model did have some drawbacks i.e. it was bulky and hence a little slow. To navigate these issues, researchers from Hugging Face proposed DistilBERT, which employed knowledge distillation for model compression.
What is DistilBERT Model?
The DistilBERT model was proposed in the blog post Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT, and the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than google-bert/bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE ( General Language Understanding Evaluation) benchmark.
Why DistilBERT Model?
As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, the team at hugging face propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, the hugging face team leverage knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster.
How DistilBERT Model works?
DistilBERT is based on the concept of knowledge distillation, which is a model compression technique to transfer the knowledge from a large, cumbersome model (the teacher model) to a smaller model (the student model) while retaining most of the performance.
1. Transformer Architecture:
DistilBERT, like its bigger brother BERT, relies on the Transformer architecture. This powerful approach excels at various NLP tasks.
In essence, a transformer encodes a sequence of text (words or sub-words) using an encoder-decoder structure.
The encoder captures relationships between words, and the decoder (not used in DistilBERT) can then be used for tasks like machine translation.
2. Knowledge Distillation:
DistilBERT leverages a technique called knowledge distillation to achieve its efficiency.
Imagine a complex teacher model (like BERT) that has been trained on a massive amount of data. This teacher possesses a wealth of knowledge about language.
DistilBERT acts as a student model, aiming to learn from the teacher.
Knowledge distillation doesn't simply copy the teacher's predictions. Instead, it uses a special training strategy:
The student model is trained on the original labeled data (same as the teacher).
Additionally, the student model is trained to mimic the outputs (predictions) of the teacher model on the same data. This injects the teacher's knowledge into the student.
DistilBERT's Specific Techniques:
Reduced Model Size: DistilBERT shrinks the size of the transformer architecture by having fewer layers and hidden units compared to BERT.
Intermediate Supervision: During training, DistilBERT not only predicts the final output but also learns from intermediate activations (outputs) of the teacher model. This provides richer information for the student to learn from.
Cosine Loss: In addition to the usual classification loss, a cosine similarity loss is used between the hidden representations of the student and teacher models. This loss encourages the student to align its internal representations with the teacher, promoting knowledge transfer.
Where to use DistilBERT Model?
DistilBERT's strength lies in its ability to offer good accuracy while consuming fewer resources compared to larger models like BERT. This makes it suitable for a variety of tasks where these factors are important. Here are some common use cases for DistilBERT:
Text Classification: DistilBERT can excel at classifying text into predefined categories. This can be useful for tasks like sentiment analysis (positive, negative, neutral reviews), spam detection, or topic labeling for news articles.
Question Answering: Extractive question answering, where the answer is a snippet within a given passage, can be tackled effectively with DistilBERT.
Text Summarization: DistilBERT can be used to create concise summaries of factual topics in documents or articles.
Low-power devices: Due to its smaller size, DistilBERT can be deployed on devices with limited computational resources, such as smartphones or embedded systems. This opens doors for real-time NLP applications on these devices.
Faster inference: Since DistilBERT is faster than BERT, it can be used in scenarios where quicker response times are crucial, such as chatbots or virtual assistants.
Pre-training for smaller datasets: DistilBERT itself can be further fine-tuned on smaller datasets specific to a particular task, making it useful even when large amounts of labeled data aren't available.
Overall, DistilBERT is a versatile tool for various NLP tasks when efficiency and good accuracy are both desired.
Argyle Enigma Tech Labs Used Case: Semantic Search on Community Post using DistilBERT.
Problem Statement: The use case aims to find post from a dataset based on their semantic similarity to a given query. The specific use case involves identifying relevant investment strategies related to financial literacy from a collection of posts.
The used case leverages DistilBERT, a smaller and faster version of the BERT (Bidirectional Encoder Representations from Transformers) model, which is pre-trained to understand the context of words in a sentence. The steps involved in the process are:
Data Preparation: Load community posts from a file and convert them into a list of texts.
Model and Tokenizer Loading: Load the pre-trained DistilBERT model and its corresponding tokenizer using the transformers library.
Sentence Embedding Extraction: Define a function ‘get_embedding’ to convert input text into embeddings using DistilBERT. The embeddings are derived by tokenizing the input text, passing it through the model, and averaging the last hidden states.
Embedding Calculation for Text Data: Compute embeddings for all posts in the dataset using the ‘get_embedding’ function.
Query Embedding and Similarity Calculation: Define a query related to investment strategies in financial literacy; Calculate the embedding for this query; Compute the cosine similarity between the query embedding and each post embedding.
Retrieving Result: Retrieve the result based on the input query.
Conclusion:
DistilBERT represents a significant advancement in natural language processing, offering remarkable efficiency and speed while retaining most of BERT’s performance through knowledge distillation. At Argyle Enigma Tech Labs, we have successfully applied DistilBERT for semantic search, demonstrating its capability to deliver fast and accurate results. This innovation paves the way for more accessible and practical NLP applications, enabling sophisticated language tasks across a wider range of devices and scenarios. The future of NLP is promising, with DistilBERT and similar models driving further advancements and innovations.
Comments