Benchmarking Embedding Models for Semantic Search.
I’m working on a restaurant similarity search engine and leveraging semantic search to deliver more accurate results. The goal is to enable natural language queries — like “cozy brunch spots” or “authentic Mexican food” — and return relevant matches from my dataset. I won’t dive too deep into the mechanics of semantic search in this post, the focus here is on benchmarking different embedding models to find the best fit for my use case. Along the way, I’ll also explain why I decided to replace the original model I was using.
An embedding model is a machine learning model that transforms text into numerical representations (vectors or embeddings), capturing the semantic meaning of words or phrases to enable similarity comparisons. These models have been trained specifically for this sort of task. For instance, a vector could look like this
[0.1, 0.2, 0.3, 0.4, 0.5]
.
The goal of this post is to identify the top-performing embedding model for retrieving restaurants that best match a given search query. By generating vectors for both search queries and restaurant descriptions, we can measure their cosine similarity to evaluate how likely a restaurant is to appear in the search results. Additionally, we’ll fine-tune cosine similarity thresholds to strike the perfect balance between quality (precision) and quantity (recall), ensuring that the results are both accurate and comprehensive.
Understanding metrics
Before diving deeper, let’s take a moment to clarify the metrics we’ll use to evaluate our models. These metrics — precision, recall, and a derived metric called F-score — are key to understanding performance.
If you’re already familiar with these concepts, feel free to skip ahead. For the rest of us, let’s break it down with a simple analogy:
Imagine you’re searching for Italian restaurants, and the algorithm gives you 10 results.
- Precision measures accuracy. How many of the 10 results are actually Italian restaurants? If 8 out of 10 are correct, your precision is 80%.
- Recall measures coverage. Out of all the Italian restaurants in the database, how many did the algorithm find? If there are 9 Italian restaurants total and the search returns 8 of them, your recall is about 89% (8 out of 9).
Together, these metrics help us assess the balance between quality (precision) and completeness (recall).
But you mentioned a third metric…F-score?
Sometimes, we want a single metric to summarize both precision and recall, especially when we need to balance the two. This is where the F-score comes in. It combines precision and recall into one number by calculating their harmonic mean.
For example, if your precision is 80% and recall is 89%, the F1-score would be approximately 84%. (How to calculate F1)
Note: The F1-score assumes that precision and recall are equally important. However, in some cases, you might prioritize one over the other, for example:
- If precision is more critical (e.g, reducing false positives in a medical diagnosis system), you may want to place more weight on precision.
- If recall is more important (e.g, ensuring no relevant results are missed in a search engine), you may want to prioritize recall instead.
To adjust this balance, you can use the F-beta score, a generalization of the F1-score. (F-beta formula) Where:
- beta < 1: Places more emphasis on precision.
- beta > 1: Places more emphasis on recall.
- beta = 1: Balances precision and recall equally (this is the F1-score).
Let’s get started.
Note: this post was originally a jupyter notebook. You can head over here should you want to run the code directly.
Setting up the notebook
We’re going to import everything that we need for this experiment.
Load up restaurant data
To get started, I’ve provided some sample data from my dataset, conveniently saved as a CSV file. All we need to do is load it up and get to work.
For readers following on this blog post and not on the jupyter notebook, you can download the data here.
id | description | name |
---|---|---|
1885 | Authentic Mexican tacos, prepared by Mexican chefs, now available on Crescent Street and St-Denis Street in Montreal. | Tacos Victor Crescent |
6250 | La Prep offers a wide variety of fresh, made-to-order meals daily. From pastries to salads and sandwiches, … | La Prep |
6797 | 9th-floor Montreal restaurant with a seaside-themed terrace offering stunning downtown views and H3-signature cuisine & cocktails. | Terrasse Alizé |
1937 | Unfussy Italian cafe & coffee shop with WiFi, sports on TV & a menu of pizza & pastas until late. | Expresso Bar |
6306 | Vintage design bar & restaurant with salads, sandwiches, burgers & amuse bouches on the menu. | LOCAL75 Bistro Pub |
Most of these restaurant descriptions were either sourced from Google or generated using Gemini, leveraging the metadata I had for each restaurant.
Writing test cases
To effectively evaluate our system, we need to design search queries and identify which restaurants should be retrieved for each query. These queries mimic real-world searches to ensure they align closely with the restaurants in our dataset. By doing this, we can simulate user interactions and assess the system’s performance accurately.
In addition to defining test cases, we’ll also specify the models we want to compare. For this experiment, I’ve selected a variety of all-MiniLM
models at different quantization levels and a model from Nomic AI. This will allow us to assess performance across a range of architectures and precision.
Now, let’s build a set of helper functions to handle the heavy lifting for our experiment. I’ll include clear documentation for each function to help you understand its purpose and functionality.
Measuring Performance
Now it’s time to evaluate each query against all the models and measure their performance. For this, we’ll:
- Load each model: Loop through all selected models, initialize them and generate embeddings.
- Create the necessary dataframes: Prepare the data for analysis, including cosine similarity scores for each query and restaurant pairing.
- Perform threshold analysis: Identify the optimal cosine similarity thresholds yielding the best f-beta score.
- Prepare reporting data: Generate a consolidated dataset that summarizes the results, ready for reporting and comparison.
This should give you a good overview at the performance of each model for individual queries.
Pairing each models to each queries and measuring individual performance. Remember, each query is tested against all of the restaurant descriptions, which is what ultimately is represented below. The dataframe should look something like this:
model_name | query | best_threshold | best_fbeta | recall_from_fbeta | precision_from_fbeta |
---|---|---|---|---|---|
All-MiniLM-L6-v2 Q4_K_S | Find casual spots with burgers and fries. | 0.714286 | 0.333333 | 0.400000 | 0.285714 |
All-MiniLM-L6-v2 Q4_K_S | Find restaurants with rotisserie chicken | 0.775510 | 0.500000 | 0.333333 | 1.000000 |
All-MiniLM-L6-v2 Q4_K_S | Find upscale places with cocktails and unique dishes. | 0.734694 | 0.500000 | 0.500000 | 0.500000 |
All-MiniLM-L6-v2 Q4_K_S | Looking for a place with authentic Mexican food. | 0.755102 | 1.000000 | 1.000000 | 1.000000 |
All-MiniLM-L6-v2 Q4_K_S | Looking for sushi places offering poke bowls. | 0.755102 | 1.000000 | 1.000000 | 1.000000 |
Analysis
While we could simply pick the model with the highest average F1-score and call it a day, it would be interesting to take a look deeper. By analyzing how each individual model performs on specific queries, we can uncover their strengths and weaknesses.
We can plot the performance of each model on each of the individual queries.
I won’t show all individual queries here, you can view them in the jupyter notebook.
Examining the performance of each model per query reveals some interesting patterns. While All-MiniLM-L6-v2 Q4_K_S
, All-MiniLM-L6-v2 Q4_K_M
, and All-MiniLM-L6-v2 Q5_K_M
show similar results, All-MiniLM-L12-v2 Q4_K_M
appears to lean more towards optimizing for precision at the expense of recall. On the other hand, nomic-embed-text Q5_K_M
seems to prioritize recall slightly more than precision.
We can now take the mean F-beta score for each model. This will give us a clearer picture of which model performs better overall, balancing precision and recall according to our chosen F-beta weighting.
model_name | avg_fbeta |
---|---|
All-MiniLM-L12-v2 Q4_K_M | 0.644589 |
All-MiniLM-L6-v2 Q4_K_M | 0.617367 |
All-MiniLM-L6-v2 Q4_K_S | 0.631653 |
All-MiniLM-L6-v2 Q5_K_M | 0.609524 |
nomic-embed-text Q5_K_M | 0.715801 |
The nomic-embed-text
model stands out with the best F-beta score, showcasing a balanced approach between precision and recall. However, our earlier per-query analysis suggests that if we shifted from a balanced F1 to a weighted F-beta (F < 1) emphasizing precision, All-MiniLM-L12-v2 Q4_K_M
might emerge as the top performer. This highlights how the choice of model ultimately depends on what’s most valuable in your specific context—precision, recall, or balance.
It’s also worth noting that nomic-embed-text
benefits from a larger context window, which we’re not fully leveraging. Utilizing longer descriptions could further enhance this model’s accuracy and overall performance.
Conclusion
The goal of this experiment was to compare various embedding models to identify the best performer for our restaurant similarity search. We evaluated restaurant descriptions against a variety of search queries using cosine similarity, determining the most appropriate thresholds for each model and query.
By analyzing the overall performance of each model through their respective F-beta scores, we found that the nomic-embed-text
model from Nomic AI outperformed the others when using an equally weighted F1 score. Its balanced approach to precision and recall, combined with its larger context window, makes it particularly well-suited for this task, especially if longer descriptions are utilized.
This analysis highlights not only the overall best-performing model but also the trade-offs between precision and recall across different models. It emphasizes the importance of tailoring the choice of model to the specific needs of the application—whether that means prioritizing precision, recall, or maintaining a balance between the two.
I hope you enjoyed reading this analysis. As always, you can send me any questions or feedback, I’d be happy to answer!