Using Llamafiles for Embeddings in Local RAG Applications

May 15, 2024 | Kate Silverstein

A good text embedding model is the lynchpin of retrieval-augmented generation (RAG). Given the computational cost of indexing large datasets in a vector store, we think llamafile is a great option for scaleable RAG on local hardware, especially given llamafile's ongoing performance optimizations.

To make local RAG easier, we found some of the best embedding models with respect to performance on RAG-relevant tasks and released them as llamafiles. In this post, we'll talk about these models and why we chose them. We'll also show how to use one of these llamafiles to build a local RAG app.

Note: This post only covers English-language models. Some of them might be multi-lingual but we did not take into account performance on non-English tasks when finding models to recommend.

	Model	License	Memory Usage (GP, fp32)	Embedding Size
Best overall	Salesforce/SFR-Embedding-Mistral	CC-by-NC	26.49	4,096
Best overall, commercial-friendly license	intfloat/e5-mistral-7b-instruct	MIT	26.49	4,096
Best small	mixedbread-ai/mxbai-embed-large-v1	Apache 2	1.25	1,024

Best overall: Salesforce/SFR-Embedding-Mistral (llamafile link). Why does it work so well? They carried out additional, multi-task finetuning on top of intfloat/e5-mistral-7b-instruct using the training datasets of several tasks in the Massive Text Embedding Benchmark (MTEB). For more details, see their blog post.

Best overall, commercial-friendly license: intfloat/e5-mistral-7b-instruct (llamafile link). Why does it work so well? Synthetic data generation. The authors finetune mistral-7b-instruct on various synthetic text embedding datasets generated by another LLM. For more information, see their paper.

Best small: mixedbread-ai/mxbai-embed-large-v1 (llamafile link). Why does it work so well? Data building and curation. The authors scraped and curated 700 million text pairs and trained a BERT model using contrastive training. Then, they finetuned the model on an additional 30 million text triplets using AngIE loss. For more information, see their blog post.

Which embedding model should I use?

The embedding model you should pick for your app depends on a variety of factors:

Model size and memory constraints: The larger models require more memory and are slower, so if you have lots of memory available, go with SFR-Embedding-Mistral or e5-mistral-7b-instruct. Otherwise, go with the smaller mxbai-embed-large-v1. In addition, if you have limited indexing time or you have a very large collection of documents to index, a smaller model will get the job done much faster.

Document length: You should also consider the type of data you'd like to "store" in your embeddings. The larger models have a longer max sequence length, so they can stuff a longer document into an embedding. Smaller models tend to be more suited to short texts like a sentence or maybe a paragraph. (However, if you're memory-bound and want to use a smaller model to index longer documents, you can just snippetize the docs into smaller pieces.) In this post, we did not specifically look at model performance by text length, so if you're looking for a model for long documents, you may have to go through your own evaluation process.

Generation model size: Also note that you may have to use a separate model or llamafile for the 'generation' part of RAG. mxbai-embed-large-v1 does not generate text at all (in fact, you'll get an error if you attempt this) and, since the larger models were fine-tuned specifically for embedding tasks, their generation capabilities might be somewhat worse than a model tuned for text generation. You may need to account for this when choosing a model. You'll need enough memory available for both the embedding model and the text generation model, assuming you'll be running them on the same machine.

We'll also mention that the authors of the MTEB benchmark--a widely-used embeddings benchmark suite described in more detail below--wrote a great companion guide to using their leaderboard for model selection. Their post provides a lot more detail about the model selection process than we do here and is an excellent guide to choosing an embedding model for your specific task/data. We highly recommend reading that so you can decide whether the models we recommend are actually right for your use case.

How do I use these llamafiles in my RAG app?

Llamafile is now integrated with two popular RAG app development frameworks:

LlamaIndex: tutorial
LangChain: quickstart, llamafile embeddings doc

We recommend getting familiar with one of these two frameworks via their respective quickstart guides, then following the llamafile-specific documentation at one of the links above.

In addition to the higher-level libraries above, we've provided a very minimal example of a local RAG app using llamafile here.

How did we choose these models? The short version...

As our starting point, we looked at the Massive Text Embedding Benchmark (MTEB) leaderboard, which evaluates models across a diverse battery of tasks that use text embeddings in various ways. The benchmark includes 7 task categories: classification, pair classification, clustering, reranking, retrieval, semantic textual similarity (STS), and summarization. Each task category is associated with a collection of datasets, e.g. the retrieval task includes datasets like MS MARCO and QuoraRetrieval. Models are evaluated on each dataset, then model scores are aggregated by task category. The final leaderboard ranking is determined by the average model score across all task categories.

However, you might notice that the models we recommend above are different from the top-ranked models according to this leaderboard. This is because we made two modifications during our selection process.

First, we filtered the results to include only "RAG-relevant" MTEB task categories that test embeddings in a scenario similar to how they might be used in a RAG application, where you need to retrieve text snippets related to a query from a vector store. We kept results related to clustering, reranking, retrieval, and semantic textual similarity (STS) and ignored results related to classification, pair classification, and summarization.

Second, we used a different metric to determine the overall ranking of models in the leaderboard. While the MTEB leaderboard (and many others) rank according to average model score across tasks, we instead determine the final ranking using mean task rank, or MTR. Essentially, instead of averaging across raw model score on each task, we rank each model on each MTEB dataset, then take the average across those ranks to get the model's MTR.

Why? In brief: Each task category uses its own metric, so model scores on different datasets don't necessarily have the same "units". STS scores tend to fall around 80 whereas retrieval task scores are much lower, in the mid-50s. Since scores on different task types/datasets are not necessarily commensurate, it doesn't make a lot of sense to average across them. So, instead of averaging across raw model scores, we average across model rank. This essentially maps all the scores into the same shared "unit". We didn't come up with this ourselves: This follows the advice for multiple comparisons across multiple datasets detailed in Statistical Comparisons of Classifiers over Multiple Data Sets (Demšar, 2006). For a longer explanation of this process, see the Appendix.

After these two changes, our top-10 leaderboard looks like:

Rank	Model	RAG MTR%	RAG Average (40 datasets)	Clustering Average (11 datasets)	Reranking Average (4 datasets)	Retrieval Average (15 datasets)	STS Average (10 datasets)
1	SFR-Embedding-Mistral	5.72	63.66	51.67	60.64	59.00	85.05
2	e5-mistral-7b-instruct	9.92	62.33	50.26	60.20	56.89	84.63
3	voyage-large-2-instruct	10.23	63.68	53.35	60.09	58.28	84.58
4	GritLM-7B	11.91	62.33	50.61	60.49	57.41	83.35
5	mxbai-embed-large-v1	14.08	60.51	46.71	60.11	54.39	85.00
6	GritLM-8x7B	14.49	61.24	50.14	59.80	55.09	83.26
7	UAE-Large-V1	14.74	60.47	46.73	59.88	54.66	84.54
8	voyage-lite-02-instruct	14.87	62.91	52.42	58.24	56.60	85.79
9	google-gecko.text-embedding-preview-0409	15.12	61.10	47.48	58.90	55.70	85.07
10	GIST-large-Embedding-v0	15.71	59.99	46.55	60.05	53.44	84.59

You can see our full, revised leaderboard here.

To make our final recommendations, we 1) eliminated the closed-source models (only the voyage and google-gecko models made it into the top 10) and 2) restricted our list to models whose architecture is compatible with the gguf file format (also see this blog post), which llamafile requires.

Conclusion

We hope this post was helpful for getting started with llamafiles and embeddings.

For reference, here are some of the links referenced in this post:

llamafile on GitHub
MTEB leaderboard and our revised MTEB leaderboard

If you have questions or feedback, reach out to us on Discord.

References

@article{demvsar2006statistical,
  title={Statistical comparisons of classifiers over multiple data sets},
  author={Dem{\v{s}}ar, Janez},
  journal={The Journal of Machine learning research},
  volume={7},
  pages={1--30},
  year={2006},
  publisher={JMLR. org}
}
@article{muennighoff2022mteb,
    doi = {10.48550/ARXIV.2210.07316},
    url = {https://arxiv.org/abs/2210.07316},
    author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo{\"\i}c and Reimers, Nils},
    title = {MTEB: Massive Text Embedding Benchmark},
    publisher = {arXiv},
    journal={arXiv preprint arXiv:2210.07316},  
    year = {2022}
}

Appendix

How did we choose these models? The long version...

Now, you might say: "Why did you write such a long post about this? What's the problem? Just use SFR-Embedding-Mistral, it's right there at the top!" In the end, you'd be right, but for the wrong reasons.

As the MTEB authors note, "Does [the leaderboard] make it easy to choose the right model for your application? You wish!" What they meant was, while this leaderboard is a great way to see embedding quality in general, it doesn't necessarily make it obvious which model is best for your specific application.

One problem is simply the complexity of the leaderboard, which is actually a good problem! The MTEB benchmark includes hundreds of datasets grouped into 7 broad task categories: classification, clustering, pair classification, reranking, retrieval, semantic textual similarity (STS), and summarization. Some of these task categories--e.g. retrieval, STS--test embeddings in a scenario similar to how they might be used in a RAG application, where you need to retrieve text snippets related to a query from a vector store. Others, like classification, do not. So, looking at overall performance on MTEB tells us a lot about embedding quality in general but the best overall model is not necessarily the best model for RAG applications.

Here's the top-10 leaderboard showing rank by mean performance across tasks side-by-side with rank by mean performance across RAG-related tasks only:

Model	rank by avg	rank by rag avg	avg	rag avg
voyage-large-2-instruct	1	1	68.28	63.68
SFR-Embedding-Mistral	2	2	67.56	63.66
gte-Qwen1.5-7B-instruct	3	3	67.34	63.06
voyage-lite-02-instruct	4	4	67.13	62.91
GritLM-7B	5	5	66.76	62.33
e5-mistral-7b-instruct	6	6	66.63	62.33
GritLM-8x7B	8	7	65.66	61.24
gte-large-en-v1.5	9	8	65.39	61.11
google-gecko.text-embedding-preview-0409	7	9	66.31	61.10
LLM2Vec-Meta-Llama-3-supervised	10	10	65.01	60.87

There isn't much difference until we get to ranks 7, 8, and 9.

However, there is a second problem with the MTEB leaderboard: regardless of which task subset you use, models are ranked according to their average performance across those tasks. The model with the highest average score across tasks wins. This is a very common way of ranking multiple models across multiple datasets, but just because it's a common method doesn't mean it's a good method.

Here is average model performance for each task type in the RAG subset:

Model	rag avg	clustering avg	reranking avg	retrieval avg	STS avg
voyage-large-2-instruct	63.68	53.35	60.09	58.28	84.58
SFR-Embedding-Mistral	63.66	51.67	60.64	59.00	85.05
gte-Qwen1.5-7B-instruct	63.06	55.83	60.13	56.24	82.42
voyage-lite-02-instruct	62.91	52.42	58.24	56.60	85.79
GritLM-7B	62.33	50.61	60.49	57.41	83.35
e5-mistral-7b-instruct	62.33	50.26	60.21	56.89	84.63
GritLM-8x7B	61.24	50.14	59.80	55.09	83.26
gte-large-en-v1.5	61.11	47.96	58.50	57.91	81.43
google-gecko.text-embedding-preview-0409	61.10	47.48	58.90	55.70	85.07
LLM2Vec-Meta-Llama-3-supervised	60.87	46.45	59.68	56.63	83.58

As you can see, there is a lot of variation in the magnitude of scores across the different tasks. STS scores tend to be in the low 80s whereas clustering scores tend to be the high 40s/low 50s. Does it really make sense to average across these numbers?

Note that each task has its own metric. STS uses the Spearman correlation coefficient whereas retrieval uses NDCG@10 (the MTEB paper lists them all). While all of these task metrics are technically bounded by the range [0, 1], the actual scores models get for these metrics aren't going to be spread across that interval the same way.

Here are the descriptive statistics for the distribution of model scores on the STS and retrieval tasks:

	STS	Retrieval
count	155	133
mean	79.26	44.93
std	5.86	10.97
min	39.10	7.94
25%	78.08	41.17
50%	80.84	48.48
75%	82.58	51.99
max	85.79	59.00

In other words, a good average score on the STS is ~83 and a good average score on the Retrieval task is ~52.

Now let's say we have two models, Model A and Model B, that get the following scores on these two tasks:

	STS Score	Retrieval Score	Average	Rank
Model A	82.58 (75%)	41.17 (25%)	61.88	2
Model B	78.08 (25%)	51.99 (75%)	65.04	1

Model A is good at STS but bad at retrieval; Model B is bad at STS but good at retrieval. Importantly, they are each equally good at one task and equally bad at the other. However, when we rank according to average score, Model B wins. In effect, this ranking method prioritizes the retrieval task over the STS task. This seems pretty arbitrary, since presumably we care about performance on all tasks equally.

So, what's the alternative? In Statistical comparisons of classifiers over multiple data sets., Janez Demšar recommends using the non-parametric Friedman test to ascertain whether there is a statistically-significant difference among a set of classifier scores on a set of datasets. The first step of this test is ranking each model on each dataset, then finding the average rank of each model across the tasks. We refer to this metric as mean task rank, or MTR.

To illustrate how to compute MTR, we'll go back to our toy example above. Model A is ranked 1st on STS and 2nd on retrieval. Model B is ranked 2nd on STS and 1st on retrieval. To compute MTR for Model A, we take the average (1+2)/2 = 1.5, and likewise for Model B. Then, we rank the entire table with respect to MTR.

	STS Score	STS Rank	Retrieval Score	Retrieval Rank	MTR	Rank by MTR
Model A	82.58	1	41.17	2	1.5	1.5
Model B	78.08	2	51.99	1	1.5	1.5

Now, Model A and Model B tie, which (at least to me) seems like a more accurate outcome than the version where Model B won.

llamafile
LLM
RAG