Writing · Case study · 2026 · Codebase RAG
RepoDoc:codebase RAG built as infrastructure
Retrieval runs over what each file means, indexing is a durable Postgres lease queue, and every token is metered against a per-project budget.
fig. 00 — results
- 0
- dim embedding per file summary
- top-5
- pgvector retrieval + top-3 memory
- 0-min
- lease per indexing job
- 0
- returned on a project over budget
Section 01
The problem
Most of the work in understanding a codebase isn't reading the file you have open - it's finding the three files you didn't know to open.
Onboarding to an unfamiliar repo means reconstructing a mental model that lives nowhere: which module owns auth, where the rate limit is configured, what actually runs on a cron. Grep finds strings; it doesn't find concepts. Documentation, when it exists, drifts from the code the day after it's written.
"RAG over a codebase" is the obvious answer and the obvious trap. The demo is easy: chunk the files, embed the chunks, retrieve top-k, call an LLM. It falls apart on real repos for reasons that have nothing to do with prompting. Raw code embeds lexically - variable names and syntax - so a query like "how does authentication work" retrieves whatever file happens to share tokens with the question, not the file that implements the concept. And ingesting a whole repository is a systems problem, not a model problem: it has to survive serverless time limits, partial failures, and the fact that an LLM call per file turns "index this repo" into an unbounded bill.
Grep finds strings; it doesn't find concepts.
Section 02
The thesis
RepoDoc takes four opinionated positions, and they're the reason it behaves differently from a generic RAG wrapper.
- 01Embed what a file means, not what it says
During indexing, each file is summarized by an LLM into a ≤100-word description of its purpose, and that summary is embedded - gemini-embedding-001 at 768 dimensions - not the raw source. Retrieval is then a cosine search over intent against pgvector, top-5. The cost is an extra LLM call per file at index time and a dependency on summary quality; the payoff is that "where are rate limits configured" retrieves the rate limiter even when the query shares no tokens with it.
- 02The database is the queue
Indexing is modeled as an IndexingJob row in Postgres with a lease, not as a call to SQS or Redis or BullMQ. A worker claims a job with an atomic compare-and-swap, holds a five-minute lease, and releases it on completion or failure. One datastore, transactional with the data it indexes, no extra infrastructure to operate.
- 03Cost is a runtime constraint, not a dashboard
Every AI request writes a QueryMetrics row (model, tokens, latency, estimated USD, retrieval and memory counts, cold-start and cache flags, success/error). A project can set monthlyCostLimitUsd; when it's exceeded, a query returns 402 and in-flight indexing pauses itself and requeues. Spend is bounded in the hot path, not reconciled after the bill arrives.
- 04Durable repo memory, separate from retrieval
RepoDoc extracts facts from each Q&A exchange into a RepoMemory store and pulls the top matches back as secondary context on later questions - capturing intent and decisions that live in conversations, not in any file. It's labeled distinctly from code, under one rule: when memory and code conflict, the code wins.
Section 03
Indexing: the database is the queue
Indexing is a job, not a request. Triggers enqueue an IndexingJob; a worker claims it atomically and processes the repo file by file.
Triggers
Claim - exactly once
Per file
Section 04
Query: grounded answer, or a pre-index fallback
Every query passes the same gate. If the project has no embeddings yet, RepoDoc answers from a live GitHub fetch instead of making the user wait for a full index.
Gate
Indexed path
Pre-index path - embeddings == 0
Section 05
Surviving the serverless wall
A large repo can't be indexed in one 60-second invocation, and a worker can die mid-job. Both are handled by the same lease + cursor mechanism.
Timing out gracefully
Recovering a dead worker
Section 06
Why this is hard
- 01Claiming a job exactly once under concurrent workers
claimJob is a conditional updateMany - it flips queued/stale-processing → processing and trusts the claim only when res.count === 1. Two workers that wake on the same job can't both win; it's a compare-and-swap on the row, no advisory locks.
- 02Surviving the serverless 60-second wall
Indexing a large repo can't finish in one invocation. The worker time-boxes itself, and when it runs out it writes a resumeAfter cursor, requeues the job, and re-kicks - so indexing makes forward progress across many short runs instead of dying at the platform timeout.
- 03Recovering a dead worker without double-processing
A lease is five minutes. A job stuck in processing with lockedAt older than that is reclaimable; @@index([status, lockedAt]) makes finding it cheap. A crashed worker's job is picked up by the next one and resumed from its cursor.
- 04Being useful before indexing finishes
queryCodebasePreindex fetches up to 22 high-value files (READMEs, configs, entrypoints) straight from the GitHub tree and answers from those, flagging the response preindex: true, while kicking the indexer in the background.
- 05Bounding spend mid-flight
isProjectOverBudget short-circuits queries to 402, and the indexer checks budget between files - a job that would blow the limit pauses rather than running the meter up.
Section 07
Design decisions & tradeoffs
Embed LLM summaries, not raw code
- Why: code embeds lexically; intent is what you query by.
- Tradeoff: an LLM call per file at index time, and retrieval is only as good as the summaries.
Postgres as the job queue (lease + CAS)
- Why: one transactional datastore, nothing extra to run.
- Tradeoff: polling, not push; not built for very high job throughput.
OpenRouter as the single chat gateway
- Why: route models per task without SDK churn, one billing surface - gemini-2.5-flash for chat, gemini-2.5-pro for README generation, Claude Haiku for docs.
- Tradeoff: an extra network hop and no native multi-provider failover.
Budget enforced in the request path and mid-index
- Why: AI cost is unbounded by default; a ceiling has to be live to matter.
- Tradeoff: a hard limit can interrupt indexing, which is why jobs are resumable.
Rate limiting and secret encryption fail open
- Why: for a single-operator product, a Redis blip or an unset key shouldn't 500 every request.
- Tradeoff: a deliberately weaker posture under those failures - documented, not hidden.
Section 08
Failure modes
Worker dies mid-index
- Its lease expires after five minutes; the next worker reclaims the job and resumes from the resumeAfter cursor.
- No stuck jobs, no re-embedding from scratch.
Serverless invocation times out
- The time-box requeues with a cursor before the platform kills the function.
- Indexing continues on the next invocation.
Project queried before it's indexed
- The pre-index path answers from live GitHub fetches.
- The result is marked preindex: true.
Project exceeds its budget
- Queries return 402 with a clear message.
- Running indexing pauses and requeues instead of overspending.
Redis unavailable / provider error
- The rate limiter falls back to a per-instance in-memory window.
- During indexing, each summary and embedding is retried twice with backoff; a persistent failure marks the job failed with the error string, surfaced in the UI for retry.
Section 09
Security model
Auth
- Clerk middleware guards everything except an explicit public allow-list; each API route re-checks the session and returns 401.
- Project access is scoped by owner and deletedAt: null on every query.
Input & SQL
- Zod validates request bodies; Prisma parameterizes all queries.
- The only raw SQL is the pgvector similarity search and embedding writes - both parameter-bound.
Secrets at rest
- Stored GitHub tokens are encrypted with AES-256-GCM in an envelope format, keyed by ENCRYPTION_KEY.
- If ENCRYPTION_KEY is unset, the code falls back to storing plaintext so the app keeps working - set the key in every environment to actually get encryption.
Webhooks
- The Clerk webhook verifies the svix HMAC signature and rejects on mismatch.
- The cron worker route authorizes with a constant-time (timingSafeEqual) shared-secret check.
Billing webhook
- The Gumroad billing webhook authenticates with a constant-time shared-secret check and maps a product permalink to a plan.
- Auto-downgrades to Starter on refund, chargeback, or cancellation.
Rate limiting
- Per-identity fixed-window limiting, preferring the platform-set x-real-ip over the spoofable x-forwarded-for.
- Returns 429 with Retry-After; it fails open under store failure.
Section 10
Testing
The suite is small and pointed - it covers the three places where a regression would be silent and expensive - and it runs in CI on every change.
- 01What's covered
20 Jest unit tests across three files exercise the query gate, the GitHub loader, and the RAG retrieval - Clerk, Prisma, Octokit, and the RAG layer mocked so each unit is tested in isolation.
- 02Gated in CI
GitHub Actions runs the suite (test:ci) on every push and pull request to main, so a regression in those paths blocks the merge rather than reaching production.
- 03The deliberate tradeoff
No integration tests against a live database and no end-to-end suite - everything is mocked at the boundaries, kept intentionally lightweight for a single-operator project.