Conversations On AI App Development CRM Enterprise IT Ethics & Governance Futures HR Industries ServiceNow on ServiceNow Platform Foundations Products & Solutions All topics For Leaders In IT & Dev Customer Experience Finance, Operations & Strategy Employee Experience Security & Risk News & Events People & Culture My List Explore All
April 9, 2024 4 min LLM2Vec: Large language models are secretly powerful text encoders Launch Email
Parishad BehnamGhader
Parishad BehnamGhader Research Scientist, ServiceNow
Workflow contributor 3-D icon
Vaibhav Adlakha Visiting Researcher, ServiceNow
Marius Mosbach
Marius Mosbach Research Scientist, ServiceNow
Dzmitry Bahdanau
Dzmitry Bahdanau Research Lead, ServiceNow
The Now Platform Xanadu release, powered by Now Assist

Nicolas Chapados and Siva Reddy also contributed to this content.

Text-embedding models convert a piece of text, such as a search query, document, or piece of code, into a sequence of real-valued numbers. Given such embeddings, we can measure the similarity, or relatedness, of pieces of text. This facilitates various important applications, such as search, clustering, retrieval, and classification.

With the widespread availability of decoder-only large language models (LLMs), such as GPT-4, LLaMA2, Mistral-7B, and StarCoder2, a pressing question in the natural language processing (NLP) research community is how best to use these models to construct powerful text embeddings.

We’re excited to present LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders, a simple and efficient solution to transform any decoder-only LLM into a powerful text encoder in an unsupervised fashion simply by using adapters (LoRA), without the need to modify the base models.

Below we give an overview of the key components of LLM2Vec and present the exciting results we got when benchmarking LLM2Vec models on the challenging Massive Text Embeddings Benchmark (MTEB).

Our LLM2Vec-Mistral ranks first on the MTEB leaderboard in the unsupervised category, first in the supervised category among the models trained on publicly available embedding data (E5), and seventh on the overall leaderboard (the other top six models are trained on synthetic data generated from GPT-4/similar-scale models).

LLM2Vec: “LLMs are secretly powerful text encoders” - Mila, McGill, ServiceNow
LLM2Vec enabling bidirectional attention, masked next token prediction, and unsupervised contrastive learning

A simple and efficient recipe

At its core, LLM2Vec consists of three simple steps:

  1. Enabling bidirectional attention
  2. Adaptation via masked next-token prediction (MNTP)
  3. Adaptation via unsupervised contrastive learning

Adapting a model with the LLM2Vec approach is highly efficient and works with parameter-efficient fine-tuning methods such as LoRA. Additionally, the adaptation can be performed using a general domain corpus such as Wikipedia, requires only a few hundred training steps, and can be run on a single GPU.

State-of-the-art performance

LLM2Vec is not only simple and efficient, but it also leads to state-of-the-art performance on the challenging MTEB, both in the unsupervised and supervised setting (among models trained only on publicly available data).

Unsupervised results

We applied LLM2Vec to some of the best-performing LLMs available and evaluated the resulting text—embedding models on MTEB. In the unsupervised setting—i.e., without using any labeled training data for contrastive learning—our LLM2Vec-transformed models achieved a new state-of-the-art performance of 56.80, outperforming the previous unsupervised approach by a large margin.

Table showing unsupervised results when applying LLM2Vec to Encoder-only LLMs, S-LLaMA-1.3B, LLaMA-2-7B, and Mistral-7B
Table showing supervised results when applying LLM2Vec to previous work with public data only, S-LLaMA-1.3B, LLaMA-2-7B, and Mistral-7B

Supervised results

LLM2Vec can also be easily combined with supervised contrastive learning. As our results show, applying LLM2Vec before supervised contrastive learning leads to a substantial improvement.

Moreover, LLM2Vec in combination with Mistral-7B, currently the best-performing 7 billion-parameter LLM, leads to a new state-of-the-art performance of 64.80 on MTEB among models trained only with publicly available data.

Highly sample-efficient

LLM2Vec-transformed models require less training data to perform well compared to training models without the LLM2vec transformation.

These results make us particularly excited about challenging real-world scenarios where large amounts of labeled data might be costly to acquire.

Use it on your own data

We’ve made it easy for you to use our LLM2Vec-transformed models. LLM2Vec class is a wrapper on top of Hugging Face models to support sequence encoding and pooling operations. The steps below showcase an example of how to use the library.

Diagrams showing the amount of data needed to train Sheared-LLaMA-1.3B, Llama-2-7b-chat-hf, and Mistral-7B-Instruct-v0.2
Code to initialize the model and apply MNTP-trained LoRA weights on top

Preparing the model

Here, we first initialize the model and apply MNTP-trained LoRA weights on top. After merging the model with MNTP weights, we can either:

  • Load the unsupervised-trained LoRA weights (trained with SimCSE objective and wiki corpus)
  • Load the model with supervised-trained LoRA weights (trained with contrastive learning and public E5 data)

Applying LLM2Vec wrapper

Then, we define our LLM2Vec encoder model as follows:

from llm2vec import LLM2Vec 
 
l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)
 

Inference

This model now returns the text embedding for any input in the form of [[instruction1, text1], [instruction2, text2]] or [text1, text2]. While training, we provide instructions for both sentences in symmetric tasks and only for queries in asymmetric tasks.

Code showing the text returned for any input, for both sentences in symmetric tasks and queries in asymmetric tasks

Summary

As demonstrated above, LLM2Vec is a simple unsupervised approach that can transform any pretrained decoder-only LLM into a strong text encoder.

If you’re as excited about LLM2Vec as we are, check out our hands-on tutorial, which walks you through the different steps of our method. We also welcome contributions on Github and invite the community to share their LLM2Vec-transformed models.

Research: Project page

Code: LLM2Vec on GitHub

Tutorial: Learn how to apply LLM2Vec to LLaMA-2

Find out more about ServiceNow AI Research.

Next up
Dive into more conversations AI App Development CRM Enterprise IT Ethics & Governance Human Resources Industries ServiceNow on ServiceNow Platform Foundations Products & Solutions All Topics
Stay in the know Join Us
stay in know image
Alt