Knowledge graph for the garden
Research into making the garden smarter about its own connections, from automatic backlinks to a full world model.
This field note tracks research into making the garden smarter about its own connections: automatically discovering semantic links between content, building a “world model” from the garden’s own structure, and using that model to find relevant external content.
The idea
Right now, connections in this garden are manual. Wiki-links, tags, and the recommender system all work, but they depend on me knowing what connects to what. What if the garden could discover its own structure?
Three goals:
- Smarter backlinks: automatically suggest and maintain semantically relevant links between content, beyond explicit wiki-links
- Garden world model: extract a knowledge graph from all content, capturing entities, topics, and relationships
- External discovery: use the world model to periodically find relevant articles, papers, and resources on the internet
Inspiration: Apple’s Saga
The Saga platform (SIGMOD 2022) builds knowledge graphs at scale using a hybrid batch-incremental design. Not everything transfers to a garden with ~200 items, but several ideas do:
What transfers:
- Delta processing: don’t rebuild the link graph from scratch on every change. Track what’s added/updated and only reprocess those items
- Extended triples with provenance: store links as (source, relationship, target, confidence, method). The “method” metadata (manual, auto-extracted, external) lets you filter and audit
- Blocking + lightweight similarity: don’t compare every pair (200 items = 19,900 pairs). Use cheap signals first (shared tags, overlapping terms), then run expensive semantic similarity only within candidate blocks
- Confidence scores: auto-generated links get a confidence score. High confidence = shown as “related”. Low confidence = queued for review
- Gap identification: profile the content graph for orphan nodes and sparse regions to find what’s missing
From the follow-up paper (SIGMOD 2023):
- Embeddings for similarity: compute embeddings per item (sentence-transformers works fine at this scale), use cosine similarity to find non-obvious connections
- Semantic annotation of external content: when bookmarking external content, auto-link it to existing items via embedding similarity
- Plausibility scoring: use the graph’s embeddings to check whether a proposed new link “makes sense”
Research questions
- What’s the simplest pipeline that gives useful results at garden scale?
- Can this run at build time (Astro) or does it need a separate process?
- How to handle the review step: present candidates for manual approval vs. auto-linking?
What embedding model works well for short-form content with mixed topics?Answered: bge-m3, a non-generative BERT-family encoder. Selected for score discrimination, multilingual support, thin-content robustness, and focused encoder architecture (not an LLM). See model selection criteria.- How to balance serendipity (surprising connections) with relevance? Partially answered: 0.55 cosine similarity threshold yields 2,771 candidates across 157 items. Most high-scoring pairs are obvious (same-series items, book pairs on the same topic). The interesting cross-collection candidates are buried in noise. See threshold tuning analysis for filtering strategies and research.
Rough architecture
Content items (markdown)
↓
1. Extract key phrases + entities per item
2. Compute embeddings (bge-m3 via Ollama)
3. Block by shared tags/entities
4. Within blocks: cosine similarity on embeddings
5. Threshold → auto-links (high confidence) + candidates (review)
6. Store as extended triples: (item, relates-to, item, confidence, method)
↓
Static link graph (deployed with site)
↓
Periodic enrichment job:
- Profile graph for orphans / sparse clusters
- Search external sources for gap topics
- Embed results, match to internal items
- Flag new candidates for review
Status
- Read and annotate Saga papers (reading notes)
- Survey embedding models (model survey)
- Set up models locally + compare on 10 items (experiment)
- Full-scale embedding run on 157 items (experiment)
- Analyze threshold and filtering strategies (write-up)
- Research key phrase extraction methods (write-up)
- Prototype: extract key phrases from 10 garden items (experiment)
- Full-scale key phrase extraction on 92 items (analysis)
- Design review UI for link candidates
- First review round: 50 approved cross-links applied to 50 files
- Explore page: interactive spatial map built on the embedding + key phrase data (spun off as separate project)
- Nightly rebuild pipeline: scheduled task for embeddings, key phrases, and explore data
- Chunk size analysis: content length distorts embeddings, normalized re-embedding (write-up)
- Enrich thin content: video transcripts, book summaries for better embeddings
- Re-generate link candidates with normalized embeddings (210 items, 0.70 threshold, 269 new candidates)
- Auto-generate trails from project-tagged content (built into
build-explore-data.cjs) - Enriched embedding input: keyphrases + reason fields for better map accuracy
- Stable map positions: pinned in
map-roots.json(field note) - Wiki-link graph as positioning signal (nudge linked items toward each other)
- External content discovery (spun off as separate project)