← Stream
field notes

Embedding models: a survey for the garden

A non-coder's guide to text embedding models, from word counting to semantic understanding, surveyed for use in a personal knowledge garden.

Research for the knowledge graph project: what embedding model should power the garden’s semantic linking? This surveys the landscape from simple to sophisticated, with a focus on models that run locally without API calls.

What embeddings do

An embedding turns a piece of text into a list of numbers (a “vector”). Texts with similar meaning end up as similar lists of numbers. You can then compare any two pieces of content by comparing their vectors, and the math tells you how semantically close they are, even if they share no words at all.

The history of embeddings is a story of getting better at capturing what text means, not just which words it contains.

Five eras of text embedding

Era 1: counting words (1970s-2000s)

TF-IDF and BM25 are the simplest approaches. They create a sparse vector where each dimension is a word from the vocabulary, weighted by how distinctive that word is. “The” gets near-zero weight; “serendipity” gets high weight.

No semantic understanding at all. “Car” and “automobile” are completely unrelated. Only exact word overlap counts. Still, BM25 remains the backbone of search engines like Elasticsearch because it’s fast and surprisingly effective for keyword search.

Key papers:

  • Sparck Jones, “A statistical interpretation of term specificity” (1972): introduced IDF
  • Salton & Buckley, “Term-weighting approaches in automatic text retrieval” (1988): formalized TF-IDF

Era 2: discovering latent topics (1988-2013)

Latent Semantic Analysis (LSA) takes a TF-IDF matrix and compresses it using matrix math (singular value decomposition). This discovers hidden topics: if “car” and “automobile” tend to appear in similar documents, they end up near each other in the compressed space.

The first method to capture synonymy. But it treats every document as a bag of words (order doesn’t matter), and every word gets a single representation regardless of context.

Key paper:

Era 3: learning word meaning (2013-2018)

Word2Vec, GloVe, and FastText train neural networks to predict words from their context. The byproduct is a dense vector per word (typically 300 dimensions) that captures meaning. The famous result: king - man + woman ≈ queen.

The catch: these produce word-level vectors. To get a document embedding, you average all word vectors, which muddies the signal for mixed-topic texts. And each word gets one fixed vector regardless of context: “bank” the river and “bank” the institution share the same representation.

FastText (Meta, 2016) improves on this by using character fragments, so it can handle typos and words it hasn’t seen before.

Doc2Vec extends Word2Vec to learn document-level vectors directly, but it’s finicky and largely superseded.

Key papers:

Era 4: contextual understanding (2018-present)

Sentence-BERT and the sentence-transformers library brought the transformer revolution to embeddings. Unlike static word vectors, every word now gets a different vector depending on its context. “Bank” in “river bank” vs “savings bank” produces different representations. Word order matters. Negation works.

These models produce a single dense vector (typically 768 dimensions) for an entire sentence or paragraph. Pre-trained models work out of the box and run on a laptop.

Key papers:

  • Vaswani et al., “Attention is all you need” (2017): the transformer architecture
  • Devlin et al., “BERT” (2018): bidirectional pre-training
  • Reimers & Gurevych, “Sentence-BERT” (2019): made BERT practical for semantic similarity

Era 5: LLM-based embeddings (2023-present)

OpenAI, Cohere, and Google offer embedding APIs using billion-parameter models. These are marginally better in quality but require sending your data to an external service. The more interesting development is that open-source alternatives have nearly closed the gap and run locally.

Open-source models that run locally

This is the interesting part for the garden. All of these run on a laptop CPU, no GPU needed, no API calls, no data leaving the machine.

The classics (sentence-transformers)

ModelParametersSizeEmbedding dimsMax inputLanguages
all-MiniLM-L6-v222M~80 MB384256 tokensEnglish
all-mpnet-base-v2110M~420 MB768384 tokensEnglish

Simple to use, well-documented, but English-only and showing their age.

Current generation (2024-2026)

ModelParametersSizeEmbedding dimsMax inputLanguagesDutch
EmbeddingGemma (Google)308M<200 MB quantized128-7682,048 tokens100+Yes
Nomic Embed v2305M~600 MB256-7688,192 tokens100+Yes
BGE-M3 (BAAI)568M~1.2 GB10248,192 tokens100+Yes
Jina v3570M~1.1 GB32-10248,192 tokens89+Yes
Qwen3-Embedding 0.6B600M~523 MB32-102432,768 tokens100+Yes
E5-NLvariesvariesvaries512 tokensDutchNative

All available through Ollama, which makes running them locally as simple as pulling a Docker image.

Key papers for these models:

Two concepts worth knowing

Matryoshka embeddings: named after Russian nesting dolls. The model packs the most important information into the first N dimensions of the vector. You can truncate a 768-dimensional vector to 256 dimensions and keep ~98% of quality. Supported by EmbeddingGemma, Nomic, Jina, and Qwen3.

Key paper: Kusupati et al., “Matryoshka representation learning” (NeurIPS 2022)

Instruction-aware embeddings: newer models let you describe what kind of similarity you’re looking for. Instead of just comparing text, you can say “find documents about the same topic” or “find documents with a similar argument structure.” This evolved from simple prefixes (2023) to full natural language instructions (2025).

What this means for the garden

The garden has ~176 items, mostly 100-500 words, covering diverse topics (AI ethics, book reviews, voice coaching, design principles, programming). The requirements are:

  1. Semantic, not keyword: we want to find that an article about “authentic voice online” relates to a book review about “the ethics of AI writing,” even if they share no words
  2. Local: no API calls, no data leaving the machine
  3. Dutch would be nice: some content and future content may be in Dutch
  4. Small enough: this runs at build time or as a script, not as a service

The sentence-transformers classics would work for a quick prototype. For production, EmbeddingGemma (tiny, multilingual, Matryoshka) or Nomic v2 (long context, multilingual, fully open) look like the strongest candidates.

Further reading

Surveys and benchmarks:

Knowledge graphs and embeddings:

Choosing a model for your dataset

The experiment comparing all three models on 10 garden items (full results) showed they agree on the strongest and weakest connections. The differences are in how they distribute similarity scores, and that distribution matters more than raw quality.

Four questions to ask:

1. How discriminating do you need? nomic-embed-text scores everything between 0.54 and 0.85: a narrow band where most items look somewhat related. bge-m3 spreads from 0.39 to 0.73: related pairs score clearly higher than unrelated ones. embeddinggemma goes widest (0.15 to 0.67). If you need to set a threshold for “related enough to suggest,” wider spread makes that easier.

2. What languages are in your content? All three claim 100+ languages, but claims and quality differ. If your content is mixed-language (this garden has both English and Dutch), you want a model trained with strong multilingual data. bge-m3 was specifically designed for multilingual retrieval. nomic and embeddinggemma support Dutch but it’s not their focus.

3. How robust is it to thin content? A personal knowledge base has items ranging from 43 words (a quick thought) to 1800 words (a full essay). embeddinggemma struggled with the shortest items and publisher blurbs, scoring them near zero against everything. bge-m3 and nomic handled short content better, likely because they’re less sensitive to writing style and more focused on semantic content.

4. Does it need to be a generative model? embeddinggemma is based on Gemma, a generative LLM distilled down for embeddings. nomic uses a custom BERT variant. bge-m3 is a pure BERT-family encoder: it only reads, never generates. For an embedding-only task like semantic linking, a non-generative encoder is a better fit. It’s architecturally focused on understanding text, not predicting next tokens. Less overhead, more predictable behavior, and no risk of the model “thinking” in a generative direction that doesn’t serve the task.

This isn’t a universal rule. For tasks where you need to generate text alongside embeddings (like retrieval-augmented generation with inline citations), an LLM-based embedding model can make sense because its representations are closer to what the generator expects. But for pure similarity search on a static corpus, where the only question is “how related are these two texts?”, a focused encoder is the right tool. The MTEB benchmarks confirm this: the best LLM-based embeddings and the best encoder-based embeddings perform nearly identically on retrieval tasks. The difference is efficiency and predictability, not raw quality.

For this garden, bge-m3 won on all four: clearest discrimination, strongest multilingual support, reliable handling of short items, and a focused encoder architecture. The cost is size (1.2 GB vs 274 MB for nomic) and slightly slower inference. At build-time scale, that cost is irrelevant.

Status

This survey answers the “what embedding model?” question from the project tracker. Three candidate models are now set up locally via Ollama. bge-m3 was selected for the full-scale run based on the criteria above.

Mycelium tags, relations & arguments