Ask my research corpus.
A focused RAG system over a curated corpus of papers on physics-informed neural networks, neural operators, model compression, and LLM inference — the literature I actually use day-to-day. Retrieval runs entirely client-side using BM25 over precomputed indices; an LLM call (your choice of provider) writes the answer. Open the behind-the-scenes panel to see the retrieved chunks, scores, and exact prompt.
Ask anything about the corpus.
Retrieval is precomputed and instant. The LLM fills in the answer using only retrieved context — try one of these:
What's actually happening here?
Most "chat with your PDFs" demos hide the retrieval step behind a black box. This one shows everything: the chunks pulled, their similarity scores, and the exact prompt sent to the LLM. That transparency is the whole point — if retrieval fails, you can see why.
Architecture
- Static corpus: 24 hand-curated chunks from 16 scientific ML papers, shipped as a 30KB JSON file. No vector database, no backend.
- Retrieval: BM25 with light stemming, computed in the browser. Tokens from titles, sections, and tags are weighted via duplication (title × 3, tags × 2, section × 2, text × 1).
- Reranking: After BM25, results get a small boost for tag overlap with the query. Top-4 retained.
- Generation: Any OpenAI-compatible endpoint — OpenAI, Groq, Together, OpenRouter, local Ollama, anything that speaks
/v1/chat/completions. You bring your own key; it stays in localStorage and never leaves your device except to the endpoint you configure.
Honest trade-offs
- BM25 vs dense embeddings. Real production RAG uses dense vector embeddings (e.g.
text-embedding-3-small) which handle synonyms and paraphrasing better. BM25 wins for a corpus this small (24 chunks) and ships with zero infrastructure — but I'd swap it for dense + a reranker (e.g. Cohere Rerank) at production scale. - No streaming. Responses arrive in one chunk after the LLM call completes. Easy to add streaming with SSE; left out to keep the demo focused.
- No conversational memory. Each question is independent. Multi-turn would need history-aware query rewriting (e.g.
condense_questionpattern), which is straightforward but adds an extra LLM call.
Why not "chat with all the PDFs"?
A focused corpus is the whole point. The system is good at scientific ML because the chunks are tagged, sectioned, and curated. Throw 10,000 random PDFs at the same architecture and recall collapses. Choosing a scope is a real engineering decision — and a more interesting portfolio piece than another generic Q&A bot.
The corpus
Covers Raissi et al. (PINNs), Karniadakis et al. (PINN review), Sahli Costabal (cardiac PINN), Li et al. (FNO), Lu et al. (DeepONet), Chen et al. (Neural ODE), Hinton (distillation), Han (deep compression), Frankle & Carbin (lottery ticket), Jacob (QAT), Dettmers (LLM.int8), Frantar (GPTQ), Dao (FlashAttention), Kwon (vLLM), Buoso (WarpPINN), Holzapfel & Ogden, Guccione, plus de Avila Belbute-Peres (differentiable physics), JAX, Adam, and Transformer.