Find a cheaper LLM that holds your quality.
Bring your prompts. We test them across 4,804 models, run nine metrics, and recommend the right swap, with the receipts.
4,804
models
131
providers
~1h 52m
median audit
▸ 01 · loading your prompts
// live story · 10 phases · ~28s loop · this is the audit, narrated.
[01] the wedge
Other tools score the world. We score your work.
MMLU paid no invoices.
your prompts will.
[02] the landscape
Every model. Every provider.
4,804
models indexed
131
providers
1,262
open-weights
// live count from models.dev · refreshed hourly · we test the qualifying subset against your prompts.
[03] what we measure
Nine dimensions. One picture.
Quality on the right, speed on the bottom, cost on the left. The bigger the candidate’s polygon, the better the fit. Hover any axis to see why it matters.
baselinecandidate
metric inspector
Hover any axis to see why it matters.
Each ring step is a 20% improvement on that axis. Bigger polygon = better fit. Smaller polygon = the gpt-5 baseline.
- 01Judge score
- 02Win rate
- 03Schema validity
- 04Tool-call accuracy
- 05TTFT
- 06Inter-token latency
- 07P99 latency
- 08Blended $/1M
- 09Cost / success
−93%
blended cost
−56%
time to first token
−4%
judge quality (delta)
+42%
win rate
[04] how it works
Three calls. No proxy. No router.
audit window
~1h 52m
median end-to-end
Bring your prompts
A prompt + ≥20 real sample inputs. JSONL upload or the form.
We test the alternatives
Fanned across qualifying candidates at 10× parallelism. Real outputs, cost, latency.
You see the receipts
Per-metric deltas, side-by-side outputs, judge transcripts. One-line swap.
// stateless analyzer. we never sit in your request path.
[05] anatomy
From your prompt to the receipt.
Six plates make the platform legible. Each one is one phase of the audit, drawn.
plate iRouting fan-out
A single prompt fans across qualifying candidates in parallel. Each one processes every sample, then a judge model evaluates head-to-head against your baseline.
▸ 1,128 calls · 8 × 47 × n=3
plate iiQuality–cost frontier
Every candidate is plotted on a quality-vs-cost plane. The curve is the pareto frontier, you can't get cheaper without losing quality past it.
▸ recommendation = inflection point
plate iiiCandidate filter
Hard constraints cut the catalog first: context window, output mode, data residency, tool support. Most of the world's models drop out here.
▸ 4,804 → 24 → 8
plate ivSample variance
A median is a lie if the variance is huge. We compute n=3 samples per prompt, then plot the spread, a tight box beats a high mean.
▸ 47 samples · 3 runs each
plate vJudge tally
Blind pairwise: the judge sees both outputs without knowing the source, then votes. Aggregated across 750 matchups, the winner is the one most often picked.
▸ 750 head-to-heads · gpt-5 jury
plate viStreaming inference
All eight candidates stream at once at 10× parallelism. Token-by-token, we log TTFT, inter-token latency, and total output length per sample.
▸ 10× parallel · ~42 min wall-clock
// judge-model: GPT-5 by default · swap to claude-opus-4.7 in the API.
n=3 samples per prompt for variance control · same temperature across all candidates
[06] four reasons
Cost is the headline. It is rarely the only reason.
−87%
median blended cost
Frontier rates for paring-knife work.
−56%
median TTFT
Speed your users feel.
1,262
open-weights candidates
Self-hostable when the cloud isn't an option.
3.4×
median TPM headroom
Headroom for the peak.
[07] pricing
Priced by the data we audit, not by seat or token.
Every plan is a fixed scope on a fixed price. You bring your prompts, we hand back the receipts. No per-seat sprawl, no usage-based surprises.
tier · 01
Spot Check
up to 50 prompts · 1 task type
Bring one task, get one recommendation. Five-day turnaround.
for solo teams or a single workload you want to sanity-check.
- ◆3 candidate models compared
- ◆9-metric scorecard on your data
- ◆Single PDF + JSON receipt
- ◆5 business days
tier · 02
Production Audit
up to 500 prompts · 3 task types
Per-task recommendations with routing config you can ship in 72 hours.
for teams running real production traffic across multiple workloads.
- ◆Per-workload model recommendation
- ◆Routing config (OpenRouter / LiteLLM / direct)
- ◆9-metric scorecard + variance bands
- ◆72-hour turnaround
- ◆Quarterly re-audit included
tier · 03
Continuous Routing
unlimited prompts · live data
We watch your traffic. Re-audit monthly. Route via our API or yours.
for teams routing at volume who want always-on optimization.
- ◆Unlimited prompts + task types
- ◆Live cost / latency / quality dashboard
- ◆Auto-route via API (or export weights)
- ◆Shared Slack channel
- ◆Monthly model-fit review
tier · custom
On-prem, regulated data, or something we have not seen before.
HIPAA, GDPR, SOC 2, air-gapped, custom evals, agent harnesses, fine-tunes in the mix. If your audit does not fit the three tiers above, it probably fits here. We scope, you decide, we ship.
- ◆on-prem / VPC deployment
- ◆custom eval rubrics
- ◆fine-tunes in the comparison
- ◆agent / tool-use audits
- ◆regulated data handling
- ◆named MSA + DPA
pricing
talk to us
quoted per scope · NDA on request
or email aayush@trainmyllm.ai with a 2-line summary of your stack.
private beta · onboarding founding teams
Get in touch.
Leave your email and we’ll reach out, usually within a day, to scope your first audit and walk through what the receipts look like for your stack.
// or email us directly at aayush@trainmyllm.ai