Making Text Embedders Few-Shot Learners

llm
research paper
Author

Santosh Sawant

Published

September 25, 2024

LLM-based embedding models have demonstrated remarkable improvements in in-domain accuracy and generalization, particularly when trained using supervised learning approaches . However, despite these advances, embedding models still struggle to follow unseen task instructions and execute complex retrieval tasks.

On other hand LLMs with decoder-only architectures demonstrate remarkable in-context learning (ICL) capabilities. This feature enables them to effectively handle both familiar and novel tasks by utilizing examples provided within their input context.

So can we leverage the ICL feature in LLMs to enhance the process of text embedding generation?

To this end, researchers have introduced a novel model bge-en-icl, which employs few-shot examples to produce high-quality text embeddings. This approach integrates task-related examples directly into the query side, resulting in significant improvements across various tasks.

Finally, all this is done through fewshot contrastive training. Consider a query-passage pair (qi , pi) in an embedding task. first construct an example template as follows:

⟨Instruct⟩ {task definition} ⟨query⟩ {qi} ⟨response⟩ {pi}

Here, ”task definition” represents the description of the specific embedding task. This example template is applied to new input queries for each embedding task. For a relevant querypassage pair (q+, p +), the modified query q + exp is constructed as follows:

{example 1} ... {example n} ⟨Instruct⟩ {task definition} ⟨query⟩ {q +} ⟨response⟩

All modified queries and passages in the corpus are encoded using the same LLM to obtain their embedding representations. Specifically, [EOS] token is appended to the end of the input modified queries and passages, feeding them into the LLM to obtain embeddings (hq + exp , hp+ ) by extracting the final layer’s [EOS] vector. Lastly a standard InfoNCE loss function L is applied, utilizing both in-batch negatives and hard negatives for training.

This approach necessitates no modifications to the model’s architecture; instead, it involves altering the prompt on the query side to include in-context learning features in the embedding generation task. Despite its simplicity, it proves highly effective on the MTEB and AIR-Bench benchmarks.

Paper : https://arxiv.org/pdf/2409.15700