Show HN: CodeLens.AI– Community benchmark comparing 6 LLMs on real code tasks

5 points by skrid 2 days ago

Hi HN! I built CodeLens.AI - a community-driven AI benchmark using real developer code tasks.

The problem: Existing benchmarks use synthetic problems. I wanted to know which LLM is best at MY actual code challenges.

How it works:

• Submit code + describe your task ("refactor this", "find security issues", etc.)

• 6 models solve it in parallel: GPT-5, Claude Opus 4.1, Claude Sonnet 4.5, Grok 4, Gemini 2.5 Pro, o3

• AI judge scores each solution (correctness, security, performance, etc.)

• You vote on the real winner

• Public leaderboard shows which models actually win on real-world tasks

Currently have 10 evaluations live (100% vote completion rate). Early patterns emerging:

• GPT-5 leads overall with 40% win rate (4/10 wins)

• Gemini 2.5 Pro dominates security tasks

• GPT-5 strongest at refactoring

• Claude Sonnet 4.5 at optimization tasks

Queue system keeps costs predictable ($10/day = 15 free evaluations for the community).

Free during beta - would love your feedback!