How to Design Your Knowledge Base for RAG

A practical framework for designing knowledge bases that power RAG systems. Covers when to use BM25, vector databases, and knowledge graphs.

Krishna C

January 20, 2026

•

9 min read

TL;DR

Your knowledge base design should match your data, not the hype. Start with BM25 if queries are predictable. Use vector search for unstructured content. Add a knowledge graph only when entity relationships matter for answering questions.

I've designed knowledge bases that ranged from simple BM25 search to full graph+vector hybrids. The biggest lesson? Most teams overcomplicate this. They jump straight to vector databases when a well-tuned keyword search would work better for their data.

Here's the framework I use to decide how to design a knowledge base.

The Three Level RAG Hierarchy

Level 1: Statistical and Rule-Based Retrieval

When to use: Predictable queries, structured data, known access patterns

This is the highest-accuracy approach when your domain is well-understood. No vector database required. It often outperforms naive vector search because it's precise when query patterns are known.

Algorithm	Best For
BM25 / TF-IDF	Keyword-heavy domains (technical documentation)
Recency Weighting	News, logs, time-sensitive content
Frequency / Popularity	FAQs, common support queries
Rule-Based Routing	When you know which document answers which query type
Metadata Filtering	Date ranges, categories, authors, document types

Many production systems use BM25 + metadata filters and outperform naive vector search because they're precise when the domain is well-understood.

Level 2: Vector Database Retrieval

When to use: Data is substantial but not interrelated. Books, images, documents where content is self-contained.

This is the right choice when semantic similarity matters and you need to find conceptually related content rather than exact keyword matches.

Chunking Methods

Method	How It Works	Best For
Fixed Size	Split every N tokens with overlap	General purpose, simple baseline
Sentence Based	Split on sentence boundaries	Conversational content, Q&A
Paragraph / Section	Respect document structure	Well-formatted docs (markdown, HTML)
Semantic Chunking	Split when embedding similarity drops	Varied content, topic shifts
Recursive / Hierarchical	Try large chunks, split if too big	Mixed document types
Document-Specific	Code: by function; Legal: by clause	Domain-specific corpora
Agentic Chunking	LLM decides chunk boundaries	High-value, complex documents

Chunk Size Guidelines

Smaller chunks (128-256 tokens): Higher precision, better for specific facts
Larger chunks (512-1024 tokens): More context, but noisier retrieval

Chunk size matters as much as chunking method. Start with 256-512 tokens and adjust based on retrieval quality.

Add-Ons to Enhance Level 2

These techniques layer on top of basic vector retrieval to improve accuracy and handle edge cases.

Hybrid Search (Sparse + Dense)

Combine BM25 (keyword) and vector search with weighted fusion. Catches both exact keyword matches and semantic similarities.

1Query → [BM25 Top-K] ─┐
2                      ├→ RRF Fusion → Results
3Query → [Dense Top-K] ┘

When to add: Almost always. This is becoming the default in production systems.

Example scenarios:

Query: "Error code 0x8007045D". BM25 catches exact error code match, Vector catches related I/O error docs
Query: "Python async await tutorial". BM25 catches exact "async await" keywords, Vector catches conceptually similar concurrency docs
Query: "GDPR Article 17 compliance". BM25 catches exact "Article 17" reference, Vector catches related "right to erasure" content

Reranking Layer

Retrieve broad (top 50) with fast vector search, then rerank with a cross-encoder model to return top N.

1Query → Vector Search (Top 50) → Cross-Encoder Rerank → Top 5

When to add: When precision matters more than latency. Dramatically improves result quality with minimal latency cost.

Query	Vector retrieves (noisy)	Reranker promotes
"Refund policy for damaged items?"	General refund docs, shipping docs	Specific damaged goods policy
"Terminate employee in California?"	HR docs, general termination, CA laws	CA-specific termination procedures
"Ibuprofen and alcohol side effects?"	Ibuprofen info, alcohol info	Specific interaction warnings

HyDE (Hypothetical Document Embedding)

Generate a hypothetical answer first, embed that, then search for similar real documents.

1Query: "How do I handle database migrations in production?"
2         │
3         ▼
4LLM generates hypothetical answer:
5"To handle database migrations in production, you should use a
6migration tool like Flyway or Liquibase. Always backup your
7database first, run migrations during low-traffic periods..."
8         │
9         ▼
10Embed the hypothetical answer
11         │
12         ▼
13Search vector DB for real documents similar to this answer

Why it works: Queries are short and may not match document language. A hypothetical answer is longer, uses domain terminology, and is semantically closer to actual documents that answer the question.

When to add: Vague queries, abstract questions, or when query-document vocabulary mismatch is high.

Query	Hypothetical Answer	Finds Documents About
"Why is my app slow?"	"N+1 queries, memory leaks, unoptimized indexes..."	Database optimization, memory profiling
"Best way to structure a team?"	"Cross-functional squads, matrix orgs, pod-based models..."	Org design, team topologies
"How do I not lose money?"	"Diversify investments, maintain emergency funds..."	Investment strategies, risk management

Other Query Transformations

When to add: When users submit vague, incomplete, or poorly-worded queries.

Technique	What It Does	Example
Query Expansion	Rewrite query into multiple variants, search all	"JS not working" → "JavaScript errors", "JS debugging", "script not loading"
Step-Back Prompting	Abstract the query first, then search	"Is 140/90 BP bad?" → "Blood pressure ranges and health implications"

Hierarchical / Parent-Child Retrieval

Embed small chunks for precision, but retrieve the parent chunk (or full document) for context.

1Index: Small chunks (256 tokens)
2Retrieve: Parent chunks (1024 tokens) or full sections

When to add: When retrieved chunks lack sufficient context for good generation.

Small Chunk	Problem	Parent Chunk Provides
"The fee is 2.5% per transaction"	2.5% of what? Which transactions?	Full pricing section with tiers, caps
"Patients should avoid this if pregnant"	Avoid what? What medication?	Full drug info with name, dosage
"Returns must be initiated within 30 days"	30 days from what?	Full returns policy with definitions

Contextual Retrieval

Prepend each chunk with LLM-generated context explaining where it fits in the document before embedding.

1Original: "The company reported $5M revenue."
2Contextual: "This chunk is from Q3 2024 earnings report, Revenue section. The company reported $5M revenue."

When to add: When chunks lose meaning without document context (pronouns, references, abbreviations).

Original Chunk	Problem	With Context
"It increased by 15% YoY"	What increased? Which year?	"From Acme Corp 2024 Annual Report, Operating Expenses section: ..."
"Users must complete this before proceeding"	Complete what? Where?	"From Onboarding Guide, Step 3 - Identity Verification: ..."
"The API returns a 429 error in this case"	Which API? What case?	"From Payment Gateway Docs, Rate Limiting section: ..."

Agentic RAG

Instead of one-shot retrieval, the LLM orchestrates multi-step search:

Analyze query
Decide which index/tool to search
Evaluate results
Iterate if insufficient

When to add: Complex, multi-part questions that can't be answered in a single retrieval pass.

Query	Agent Steps
"Compare Q3 revenue to competitors, suggest pricing changes"	Internal docs → Market data → Pricing strategy → Synthesize
"What caused Tuesday's outage? Fixed similar issues before?"	Incident reports → Extract cause → Historical incidents → Check status
"Find FAANG candidates with ML experience, draft outreach"	Resume search → Rank → Email templates → Personalize

Self-RAG / Corrective RAG

After retrieval, the LLM critiques whether retrieved docs actually answer the question. If not, it reformulates and retries.

1Retrieve → Evaluate Relevance →
2  If sufficient → Generate
3  If insufficient → Reformulate query → Retrieve again

When to add: When hallucination reduction is critical and you can tolerate extra latency.

Query	First Retrieval	Self-Correction
"Cancellation fee for enterprise plans?"	General pricing (no enterprise)	Reformulates → "enterprise cancellation terms"
"Integrate Salesforce with OAuth?"	Generic OAuth + unrelated Salesforce	Reformulates → "Salesforce OAuth integration tutorial"
"Tax implications of RSU vesting?"	General RSU overview (no tax)	Reformulates → "RSU taxation at vesting"

Level 3: Graph + Vector Database

When to use: Data is interlinked in ways you know and understand

This is essential when relationships between entities are first-class citizens in your domain and users ask relational or comparative questions.

Ideal Use Cases

Domain	Why Graph + Vector
Organizational Data	"Who reports to X's manager?" requires traversing relationships
Legal / Regulatory	Rules depend on each other: "If A applies, what exceptions exist under B?"
Medical / Scientific	Entities interconnect: bacteria → resistance → drugs → diseases
Product Catalogs	"Find alternatives to X that are compatible with Y"
Knowledge Bases	Multi-hop reasoning across connected concepts

When Graph Adds Value

Relationships are explicit and known
Users ask questions requiring multi-hop traversal
Accuracy on entity relationships is critical
You need audit trails for why something was retrieved

Graph + Vector increases answer correctness when your data has inherent structure that pure semantic search would lose.

Decision Flowchart

1START
2  │
3  ▼
4Is your data structured with known query patterns?
5  │
6  ├─ YES → Level 1: Statistical/Rule-Based
7  │         (BM25 + metadata filters)
8  │
9  └─ NO
10      │
11      ▼
12    Are relationships between entities critical to answers?
13      │
14      ├─ YES → Level 3: Graph + Vector
15      │         (Knowledge graph + embeddings)
16      │
17      └─ NO → Level 2: Vector Database
18               (Choose appropriate chunking method)

Quick Reference

Data Characteristic	Recommended Approach
FAQ / Support tickets	Level 1: BM25 + popularity ranking
Technical documentation	Level 1: BM25 + metadata filters
Books / Long-form content	Level 2: Vector DB + semantic chunking
Mixed document corpus	Level 2: Vector DB + recursive chunking
Org charts / Hierarchies	Level 3: Graph + Vector
Legal with dependencies	Level 3: Graph + Vector
Medical knowledge base	Level 3: Graph + Vector

Summary

Start simple: If your queries are predictable, Level 1 (statistical methods) often produces the highest accuracy with the least complexity.

Scale to vectors: When content is unstructured and self-contained, a well-chunked vector database handles semantic similarity effectively.

Add graphs for relationships: When entities interconnect in meaningful ways, combining graph traversal with vector search significantly improves answer correctness.

The right approach depends on your data's nature, not on what's most technically sophisticated.

Thoughts? Hit me up at [email protected]

#information-retrieval

← Previous

The Complete Prompt Injection Defense Guide

A practical, step-by-step framework for defending against prompt injection attacks. Covers 11 defense layers, 50+ attack scenarios, and real CVEs. Includes an agent-ready implementation reference with code examples.

Time: The Only Commodity That Matters

Time is the only resource we can never get back. More precious than money, scarcer than gold. As we chase status and consumption, we trade away the very thing that gives life meaning.