Show HN: Open-source self-hosted LLM comparison tool for your own prompt

github.com

2 points by ycsuck a day ago

Hi HN,

I just pushed Duelr v0.1.1 to GitHub and would love your feedback.

Each month a new model claims “30 % better,” yet I was still copy-pasting prompts into half a dozen playground tabs. Existing evaluation suites (Promptfoo, LangSmith, etc.) are great but heavy; I wanted a <1-minute CLI / local web tool.

How it works - Paste a prompt (or JSON template) → click Compare All. - Duelr fires the same request to models you selected such as GPT-4o, Claude 4 Sonnet, Groq (more drivers coming). - It shows, side-by-side: - response text - end-to-end latency (ms) - token counts → cost spent for the request - simplicity and readability score (more heuristics will be coming) The goal is to see model's performance in your real prompt instead of generic academic benchmarks.

Repo: https://github.com/stashlabs/duelr

P.S.: First time I’ve ever open-sourced a full tool.