v0.1 · public beta

Find a cheaper LLM that holds your quality.

Bring your prompts. We test them across 4,804 models, run nine metrics, and recommend the right swap, with the receipts.

Get in touch $How it works

4,804

models

131

providers

~1h 52m

median audit

audit · run_a47b9c

01 / 10

▸ 01 · loading your prompts

prompts.jsonl47 samples · 2.1 MB

01▸“Classify ticket → {category, urgency, suggested_reply}”

02▸“Summarize this article into 3 bullets in 50 words…”

03▸“Extract entities from medical record (PII-aware)…”

04▸“Generate Python that validates this JSON schema…”

05▸“Translate product description to es-MX, preserve…”

06▸“Rank these search results by intent match…”

loaded · 0 PII flags · 0 malformedready to tag →

// live story · 10 phases · ~28s loop · this is the audit, narrated.

[01] the wedge

Other tools score the world. We score your work.

MMLU paid no invoices.

your prompts will.

[02] the landscape

Every model. Every provider.

4,804

models indexed

131

providers

1,262

open-weights

// live count from models.dev · refreshed hourly · we test the qualifying subset against your prompts.

OpenAI

Anthropic

Google

Gemini

Nine dimensions. One picture.

Quality on the right, speed on the bottom, cost on the left. The bigger the candidate’s polygon, the better the fit. Hover any axis to see why it matters.

comparingopenai · gpt-5vs

baselinecandidate

metric inspector

Hover any axis to see why it matters.

Each ring step is a 20% improvement on that axis. Bigger polygon = better fit. Smaller polygon = the gpt-5 baseline.

01Judge score
02Win rate
03Schema validity
04Tool-call accuracy
05TTFT
06Inter-token latency
07P99 latency
08Blended $/1M
09Cost / success

−93%

blended cost

−56%

time to first token

−4%

judge quality (delta)

+42%

win rate

[04] how it works

Three calls. No proxy. No router.

audit window

~1h 52m

median end-to-end

audit timeline1h 52m · 1,128 inference calls · ~750 judge calls

fan

judge

rank

0m45m1h 25m1h 50m1h 52m

01step

t = 0

Bring your prompts

A prompt + ≥20 real sample inputs. JSONL upload or the form.

request.httpPOST · 200 OK

01POST /v1/audit02{03 "current": "openai/gpt-5",04 "prompt": "Classify ticket → {…}",05 "samples": [ /* 47 items */ ],06 "priority": "cost"07}

02step

t + 45m

We test the alternatives

Fanned across qualifying candidates at 10× parallelism. Real outputs, cost, latency.

audit.logSTREAMING

01▸ filtering 4,804 → 2402▸ shortlist 24 → 8 by hard constraints03▸ fanning 8 × 47 × n=3 samples04▸ measuring output · cost · ttft05▸ 1,128 inference calls in ~42 min

03step

t + 1h 52m

You see the receipts

Per-metric deltas, side-by-side outputs, judge transcripts. One-line swap.

receipt.txtREADY

01▶ recommended deepseek/v3.102quality_judge 96% (−4 pts)03win_rate 71%04blended $/1M $0.52 (−94%)05ttft 320ms (−56%)

// stateless analyzer. we never sit in your request path.

powered bymodels.dev·openrouter·litellm

[05] anatomy

From your prompt to the receipt.

Six plates make the platform legible. Each one is one phase of the audit, drawn.

plate i

Routing fan-out

A single prompt fans across qualifying candidates in parallel. Each one processes every sample, then a judge model evaluates head-to-head against your baseline.

▸ 1,128 calls · 8 × 47 × n=3

plate ii

Quality–cost frontier

Every candidate is plotted on a quality-vs-cost plane. The curve is the pareto frontier, you can't get cheaper without losing quality past it.

▸ recommendation = inflection point

plate iii

Candidate filter

Hard constraints cut the catalog first: context window, output mode, data residency, tool support. Most of the world's models drop out here.

▸ 4,804 → 24 → 8

plate iv

Sample variance

A median is a lie if the variance is huge. We compute n=3 samples per prompt, then plot the spread, a tight box beats a high mean.

▸ 47 samples · 3 runs each

plate v

Judge tally

Blind pairwise: the judge sees both outputs without knowing the source, then votes. Aggregated across 750 matchups, the winner is the one most often picked.

▸ 750 head-to-heads · gpt-5 jury

plate vi

Streaming inference

All eight candidates stream at once at 10× parallelism. Token-by-token, we log TTFT, inter-token latency, and total output length per sample.

▸ 10× parallel · ~42 min wall-clock

// judge-model: GPT-5 by default · swap to claude-opus-4.7 in the API.

n=3 samples per prompt for variance control · same temperature across all candidates

[06] four reasons

Cost is the headline. It is rarely the only reason.

01Cost

−87%

median blended cost

Frontier rates for paring-knife work.

02Latency

−56%

median TTFT

Speed your users feel.

03Autonomy

1,262

open-weights candidates

Self-hostable when the cloud isn't an option.

04Throughput

3.4×

median TPM headroom

Headroom for the peak.

[07] pricing

Priced by the data we audit, not by seat or token.

Every plan is a fixed scope on a fixed price. You bring your prompts, we hand back the receipts. No per-seat sprawl, no usage-based surprises.

tier · 01

Spot Check

up to 50 prompts · 1 task type

$1,500/ one-off audit

Bring one task, get one recommendation. Five-day turnaround.

for solo teams or a single workload you want to sanity-check.

◆3 candidate models compared
◆9-metric scorecard on your data
◆Single PDF + JSON receipt
◆5 business days

Start a spot check

★ most chosen

tier · 02

Production Audit

up to 500 prompts · 3 task types

$8,000/ audit

Per-task recommendations with routing config you can ship in 72 hours.

for teams running real production traffic across multiple workloads.

◆Per-workload model recommendation
◆Routing config (OpenRouter / LiteLLM / direct)
◆9-metric scorecard + variance bands
◆72-hour turnaround
◆Quarterly re-audit included

Run a production audit

tier · 03

Continuous Routing

unlimited prompts · live data

$3,500/ month

We watch your traffic. Re-audit monthly. Route via our API or yours.

for teams routing at volume who want always-on optimization.

◆Unlimited prompts + task types
◆Live cost / latency / quality dashboard
◆Auto-route via API (or export weights)
◆Shared Slack channel
◆Monthly model-fit review

Set up continuous routing

tier · custom

On-prem, regulated data, or something we have not seen before.

HIPAA, GDPR, SOC 2, air-gapped, custom evals, agent harnesses, fine-tunes in the mix. If your audit does not fit the three tiers above, it probably fits here. We scope, you decide, we ship.

◆on-prem / VPC deployment
◆custom eval rubrics
◆fine-tunes in the comparison
◆agent / tool-use audits
◆regulated data handling
◆named MSA + DPA

pricing

talk to us

quoted per scope · NDA on request

Scope a custom audit

or email aayush@trainmyllm.ai with a 2-line summary of your stack.

private beta · onboarding founding teams

Get in touch.

Leave your email and we’ll reach out, usually within a day, to scope your first audit and walk through what the receipts look like for your stack.

4,804 candidate models·131 providers·1,262 open-weights

// or email us directly at aayush@trainmyllm.ai