Recent advancement in the Text Embedding model has been instrumental for various downstream tasks including document retrieval, sentence similarity, classification, and clustering. However, there present challenges in developing general-purpose text embedding models as such models require large amounts of training data to comprehensively cover desired domains and skills. Large language models (LLMs) offer a powerful alternative in such scenarios.
So to what extent can we leverage LLMs directly to improve text embedding models?
Introduced Gecko, a versatile text embedding model distilled from large language models. Gecko utilizes a two-step distillation process that begins with generating diverse, synthetic paired data using an LLM. Next, data quality is further refined by retrieving a set of candidate passages for each query, and relabeling the positive and hard negative passages using the same LLM.
To train Gecko, a 1.2B parameter pre-trained transformer base language model is being used that undergoes two additional training stages: pre-finetuning and fine-tuning. During pre-finetuning stage model is been trained on large self-supervised text corpus such as question-answer pairs from online forums and QA websites.
For fine-tuning, Gecko uses a novel fine-tuning dataset FRet, the Few-shot Prompted Retrieval dataset. Given a sampled passage from the web, FRet first utilizes LLMs to generate a relevant task and a query for the passage. Then, each query and task is fed into a pre-trained embedding model to obtain nearest neighbor passages, which are then scored by the LLM to mine positive and negative passages.
By combining this LLM-generated and LLM-ranked data with human-annotated data, Gecko-1B with 768-dimensional embeddings achieves the best performance on the popular MTEB benchmark, Gecko with 256 embedding dimensions outperforms all existing entries with 768 embedding size. Gecko with 768 embedding dimensions achieves an average score of 66.31, competing with 7x larger models and 5x higher dimensional embeddings. Moreover, Gecko often outperforms other systems that use either larger base models (7B) or higher dimensional embeddings (1k to 4k).