Keyword vs Semantic Search: Why the Real Win Is in the Hybrid

When people talk about “search,” they often mean two very different things. One is keyword search - the decades-old technique that matches exact words. The other is semantic search - the newer, AI-powered approach that matches meaning.

In the industry, people often pick a side and debate it like it’s tabs vs spaces. The reality? In production, hybrid search (keyword + semantic) usually wins. It combines the precision of keywords, the recall of semantics, and the quality boost of modern reranking models.

This post will walk through:

How each method works and where it excels.
What the benchmarks say (with accuracy numbers you can cite).
How to structure a hybrid pipeline with modern tools.
Cost and latency trade-offs.
Common pitfalls and how to avoid them.

1. Keyword Search: The Trusty Workhorse

Keyword search (often good old BM25) treats text as weighted terms. It doesn’t “understand” meaning - it matches strings - but it’s fast, predictable, and cheap.

Why it’s still a baseline in 2025:

Exactness - Perfect for legal, compliance, IDs, log search, and code.
Low infra cost - Compact inverted indexes run well even at huge scales.
Explainability - Easy to see why a document matched.

Limitations:

Misses synonyms and paraphrases (“car” vs “automobile”).
Struggles with natural language questions (“books about people who travel in space”).

Benchmark note:

In the BEIR benchmark (18 zero-shot retrieval datasets), BM25 still holds up surprisingly well: many dense retrievers underperform it out-of-domain without reranking. It’s why AI engineers still use it as a yardstick.

2. Semantic Search: Understanding Beyond Words

Semantic search represents queries and documents as vectors in a high-dimensional space, capturing meaning rather than exact strings.

Strengths:

Finds related terms and concepts without exact word matches.
Works well for conversational queries and vague memory searches.
Multilingual capabilities (search in English, find in Spanish).

Limitations:

Infra complexity: embeddings + Approximate Nearest Neighbor (ANN) index.
Less transparent scoring (harder to “show the math”).
Latency and memory trade-offs.

Benchmark note:

Dense retrievers like DPR can achieve ~50% top-1 accuracy on Natural Questions (NQ), but their out-of-domain performance can lag - hence the push toward hybrid retrieval.

3. What the Numbers Say: Hybrid Beats Both

Let’s get concrete.

BEIR results:

BM25: nDCG@10 ≈ 43.4 (average).
BM25 + cross-encoder reranker: nDCG@10 ≈ 52.5 (+9 points).
ColBERTv2 (late interaction): nDCG@10 ≈ 53–55 on several datasets.
Hybrid retrieval (BM25 + dense + RRF + rerank) frequently outperforms both by 10–15% across varied domains.

SPLADE-v2 (sparse-neural) achieves dense-level performance while keeping sparse inverted indexes. On BEIR, it often matches or beats dense retrievers without the vector infra cost.

ColBERTv2 uses multiple vectors per document (late interaction) and consistently ranks above single-vector dense methods while being more storage-efficient than earlier versions.

Vendor data:

OpenSearch benchmarks show hybrid + RRF outperforming either method alone across all tested datasets.
Elastic’s own experiments show hybrid with RRF improving recall and ranking quality in multilingual settings.

4. When Each Shines (and When It Doesn’t)

Scenario	Keyword Search	Semantic Search	Hybrid Search
Compliance/legal retrieval	✅ Excellent	❌ Not ideal	✅ (keyword-weighted)
Conversational FAQ	❌ Misses nuance	✅ Captures intent	✅ Best of both
E-commerce with strict filters	✅ Great	✅ Great for discovery	✅ Perfect combo
Multilingual content	❌	✅ Strong	✅ Best coverage
Code or log search	✅ Gold standard	❌ Overkill	❌ Usually not worth hybrid

5. The Hybrid Pipeline (Modern 2025 Stack)

Hybrid isn’t “run both and hope.” Modern stacks have a clear pattern:

Step 1 - Retrieve in parallel

Lexical: BM25 or BM25+.
Vector: embedding model (E5-v2, BGE-M3, or domain-specific) + ANN index (HNSW for CPU, FAISS for GPU, ScaNN for CPU speed).

Step 2 - Fuse results

Use Reciprocal Rank Fusion (RRF): stable, easy, and now built-in to OpenSearch, Elastic, and Azure AI Search.
Or use a normalized linear combination if you need explicit weighting (Elastic’s new Linear Retriever).

Step 3 - Re-rank top-K

Cross-encoder reranker (e.g., Cohere Rerank 3.5, open-source MiniLM).
Keep K small (100–300) for latency.

6. Cost and Latency Truths

Keyword search: cheapest infra, lowest latency, smallest indexes.
Vector search: bigger indexes; query cost depends on ANN algorithm (HNSW, FAISS, ScaNN).
Hybrid: small cost increase over BM25, big quality gain; reranking adds latency - cap K and monitor p95.

FAISS on GPUs can be ~8× faster than earlier GPU ANN baselines, while HNSW offers excellent CPU performance with recall >0.95 at reasonable latency.

7. Common Pitfalls (and Fixes)

Score mismatch - BM25 and vector scores are on different scales → use RRF or normalization.
Over-embedding - Route simple exact-match queries to BM25 only.
Too big K for reranking - Measure p95 latency; adjust top-K down if needed.
Blindly following leaderboards - Always evaluate on your own domain data.

8. How to Measure Success

Use rank-aware metrics on a held-out query set:

nDCG@k - overall ranking quality.
MRR@k - how quickly the first relevant result appears.
Recall@k - coverage.

Compare:

BM25 baseline
Dense retrieval
Hybrid
Hybrid + rerank

9. The 2024–2025 Shift

In the last 12 months:

Hybrid + RRF became first-class in major search platforms.
Cross-encoders like Rerank 3.5 made reranking faster and multilingual.
Sparse-neural models (SPLADE-v2) bridged the gap between keyword and dense embedding models.
ANN libraries improved - FAISS, HNSW, and ScaNN now have better recall-latency trade-offs.

10. Takeaway

Don’t choose between keyword and semantic.

Choose hybrid: parallel retrieval → RRF/score fusion → rerank top-K.

It’s how modern search stacks hit the sweet spot of coverage, precision, and cost-efficiency.