RAG Eval Comparing Vertex/Bedrock/Azure/OpenAI

colon-md 1 hour ago

Last week, I read Karpathy's gist on building a personal wiki LLM (https://gist.github.com/karpathy/442a6bf555914893e9891c11519...) and decided to try it.

The RAG pitch is take your own corpus of docs, layer an LLM over it, get a thing that answers questions grounded in your stuff. Wiki+RAG hybrid as the interesting architectural variant.

So I started building the "traditional" retrieval architectures (pure dense, BM25, hybrid RRF, rerank) to pit against the wiki+RAG variant with structure layered over the chunks.

After few days of code cleanup I have an eval testbench and a wiki LLM is only 50% built. I'm releasing the testbench now because I think the testbench is just as valuable as the RAG design itself.

What the repo does: runs four hosted RAG services against identical inputs (same 81-doc enterprise corpus, same 50 questions stratified across single-hop / multi-hop / contradiction / unanswerable, same retrieve-only scoring of 0.7×recall + 0.3×precision):

  - Azure AI Search: 84.0  (recall 90.9%, precision 67.8%)
  - Vertex AI RAG Engine: 82.6  (94.5%, 54.7%)
  - Bedrock Knowledge Bases: 82.5  (87.9%, 70.1%)
  - OpenAI File Search: 78.5  (89.3%, 53.4%)

Here's a surprise finding (maybe not a surprise to you): all four major RAG services hallucinate on every unanswerable question. 0/5 abstention correctness across the board. Was sort of expecting enterprise RAG providers like GCP, AWS, Azure, and OpenAI to respond "I don't know" to unanswerable questions.