tiny-gte: Efficient Transformer for Semantic Search

Explore tiny-gte, a distilled transformer model for efficient sentence embeddings. Learn about its performance and architecture for use in vector databases and RAG.
Author

kareem

Published

October 21, 2023

What is tiny-gte?

The tiny-gte model is a specialized sentence-transformers model designed for extreme efficiency without sacrificing too much accuracy. It maps sentences and paragraphs into a 384-dimensional dense vector space, making it perfect for tasks like clustering, semantic search, and Retrieval-Augmented Generation (RAG).

The model is a distilled version of thenlper/gte-small. Through distillation, it manages to maintain comparable performance (only slightly lower on benchmarks like MTEB) while being roughly half the size of its parent model.

Model Details

If you are building production-grade AI systems, the size and latency of your embedding model matter. Here is why tiny-gte stands out:

  • Ultra-Small Footprint: It weighs in at around ~45MB. To put that in perspective, the popular all-MiniLM-L6-v2 is nearly double the size at ~80MB.
  • Dimensionality: It produces 384D embeddings, which is the “sweet spot” for many vector databases, balancing search precision with storage costs.
  • Architecture: Based on the BERT architecture but optimized through distillation.
  • Parent Model: Distilled from thenlper/gte-small, inheriting its robust understanding of semantic relationships.

Why Size Matters: The Benefits of Small Models

In the world of LLMs, we often hear that “bigger is better.” However, for embedding models used in search pipelines, smaller models offer several critical advantages:

  1. Lower Latency: Smaller models require fewer FLOPs, meaning faster inference times. This is crucial for real-time search applications.
  2. Reduced Hosting Costs: You can run tiny-gte on cheaper hardware, even on CPU-only instances, without a significant performance bottleneck.
  3. Edge Deployment: At 45MB, this model can easily be deployed on mobile devices or in browser-based applications using Transformers.js.
  4. Memory Efficiency: You can fit more instances of the model in memory, allowing for higher throughput in multi-tenant systems.

Use Cases for tiny-gte

  • Real-time Document Retrieval: Quickly finding relevant context for an LLM prompt in a RAG pipeline.
  • Mobile AI Applications: Enabling semantic search within offline mobile apps where storage space is limited.
  • Large-scale Clustering: Processing millions of documents where the computational cost of larger models would be prohibitive.
  • Edge Search: Using tiny-gte with libraries like txtai or fastembed for local, private search on your own machine.

Performance Comparison: tiny-gte vs. gte-small

On the Massive Text Embedding Benchmark (MTEB), tiny-gte performs impressively well given its size. While gte-small might lead by a few percentage points in specific retrieval tasks, tiny-gte often provides better “value per megabyte.” If your application can tolerate a 1-2% drop in accuracy in exchange for 2x faster inference, tiny-gte is the clear winner.

References


Internal Resources

If you’re looking for more technical deep dives or information on my research, check out these sections: