v0.1 · public beta

Find a cheaper LLM that holds your quality.

Bring your prompts. We test them across 4,804 models, run nine metrics, and recommend the right swap, with the receipts.

4,804

models

131

providers

~1h 52m

median audit

audit · run_a47b9c
01 / 10

01 · loading your prompts

prompts.jsonl47 samples · 2.1 MB
01Classify ticket → {category, urgency, suggested_reply}
02Summarize this article into 3 bullets in 50 words…
03Extract entities from medical record (PII-aware)…
04Generate Python that validates this JSON schema…
05Translate product description to es-MX, preserve…
06Rank these search results by intent match…
loaded · 0 PII flags · 0 malformedready to tag →

// live story · 10 phases · ~28s loop · this is the audit, narrated.

[01] the wedge

Other tools score the world. We score your work.

MMLU paid no invoices.

your prompts will.

[02] the landscape

Every model. Every provider.

4,804

models indexed

131

providers

1,262

open-weights

// live count from models.dev · refreshed hourly · we test the qualifying subset against your prompts.

OpenAI
Anthropic
Google
Gemini
Meta
Mistral
Cohere
xAI
Grok
Inflection
AI21
Nous
Perplexity
Microsoft
Azure AI
AWS
Bedrock
IBM
NVIDIA
Databricks
Snowflake
Vertex AI
Groq
Cerebras
SambaNova
Fireworks
Together
Anyscale
Baseten
DeepInfra
Replicate
Lambda
DeepSeek
Qwen
Moonshot
Zhipu
ChatGLM
Yi
01.AI
MiniMax
StepFun
Baidu
Doubao
Hunyuan
Baichuan
Hugging Face
Ollama
vLLM
OpenRouter
Vercel AI
LangChain
LlamaIndex
LM Studio
FLUX
Stability
Midjourney
Ideogram
Runway
Luma
Pika
ElevenLabs
Suno
Sora
auditing 63 of 131 providers, scanning…
activescannedindexed

[03] what we measure

Nine dimensions. One picture.

Quality on the right, speed on the bottom, cost on the left. The bigger the candidate’s polygon, the better the fit. Hover any axis to see why it matters.

comparingopenai · gpt-5vs
Judge score96.0%Win rate71.0%Schema validity99.7%Tool-call accuracy92.0%TTFT320msInter-token latency18msP99 latency2.1sBlended $/1M$0.52Cost / success0.18¢BASELINEvsdeepseek v3.1

baselinecandidate

metric inspector

Hover any axis to see why it matters.

Each ring step is a 20% improvement on that axis. Bigger polygon = better fit. Smaller polygon = the gpt-5 baseline.

  • 01Judge score
  • 02Win rate
  • 03Schema validity
  • 04Tool-call accuracy
  • 05TTFT
  • 06Inter-token latency
  • 07P99 latency
  • 08Blended $/1M
  • 09Cost / success

93%

blended cost

56%

time to first token

4%

judge quality (delta)

+42%

win rate

[04] how it works

Three calls. No proxy. No router.

audit window

~1h 52m

median end-to-end

audit timeline1h 52m · 1,128 inference calls · ~750 judge calls
·
fan
judge
rank
·
0m45m1h 25m1h 50m1h 52m
01step
t = 0

Bring your prompts

A prompt + ≥20 real sample inputs. JSONL upload or the form.

request.httpPOST · 200 OK
01POST /v1/audit02{03 "current": "openai/gpt-5",04 "prompt": "Classify ticket → {…}",05 "samples": [ /* 47 items */ ],06 "priority": "cost"07}
02step
t + 45m

We test the alternatives

Fanned across qualifying candidates at 10× parallelism. Real outputs, cost, latency.

audit.logSTREAMING
01 filtering 4,804 2402 shortlist 24 8 by hard constraints03 fanning 8 × 47 × n=3 samples04 measuring output · cost · ttft05 1,128 inference calls in ~42 min
03step
t + 1h 52m

You see the receipts

Per-metric deltas, side-by-side outputs, judge transcripts. One-line swap.

receipt.txtREADY
01▶ recommended deepseek/v3.102quality_judge 96% (−4 pts)03win_rate 71%04blended $/1M $0.52 (−94%)05ttft 320ms (−56%)

// stateless analyzer. we never sit in your request path.

powered bymodels.dev·openrouter·litellm

[05] anatomy

From your prompt to the receipt.

Six plates make the platform legible. Each one is one phase of the audit, drawn.

Routing fan-outplate i

Routing fan-out

A single prompt fans across qualifying candidates in parallel. Each one processes every sample, then a judge model evaluates head-to-head against your baseline.

1,128 calls · 8 × 47 × n=3

Quality–cost frontierplate ii

Quality–cost frontier

Every candidate is plotted on a quality-vs-cost plane. The curve is the pareto frontier, you can't get cheaper without losing quality past it.

recommendation = inflection point

Candidate filterplate iii

Candidate filter

Hard constraints cut the catalog first: context window, output mode, data residency, tool support. Most of the world's models drop out here.

4,804 → 24 → 8

Sample varianceplate iv

Sample variance

A median is a lie if the variance is huge. We compute n=3 samples per prompt, then plot the spread, a tight box beats a high mean.

47 samples · 3 runs each

Judge tallyplate v

Judge tally

Blind pairwise: the judge sees both outputs without knowing the source, then votes. Aggregated across 750 matchups, the winner is the one most often picked.

750 head-to-heads · gpt-5 jury

Streaming inferenceplate vi

Streaming inference

All eight candidates stream at once at 10× parallelism. Token-by-token, we log TTFT, inter-token latency, and total output length per sample.

10× parallel · ~42 min wall-clock

// judge-model: GPT-5 by default · swap to claude-opus-4.7 in the API.

n=3 samples per prompt for variance control · same temperature across all candidates

[06] four reasons

Cost is the headline. It is rarely the only reason.

01Cost

−87%

median blended cost

Frontier rates for paring-knife work.

02Latency

−56%

median TTFT

Speed your users feel.

03Autonomy

1,262

open-weights candidates

Self-hostable when the cloud isn't an option.

04Throughput

3.4×

median TPM headroom

Headroom for the peak.

[07] pricing

Priced by the data we audit, not by seat or token.

Every plan is a fixed scope on a fixed price. You bring your prompts, we hand back the receipts. No per-seat sprawl, no usage-based surprises.

tier · 01

Spot Check

up to 50 prompts · 1 task type

$1,500/ one-off audit

Bring one task, get one recommendation. Five-day turnaround.

for solo teams or a single workload you want to sanity-check.

  • 3 candidate models compared
  • 9-metric scorecard on your data
  • Single PDF + JSON receipt
  • 5 business days
★ most chosen

tier · 02

Production Audit

up to 500 prompts · 3 task types

$8,000/ audit

Per-task recommendations with routing config you can ship in 72 hours.

for teams running real production traffic across multiple workloads.

  • Per-workload model recommendation
  • Routing config (OpenRouter / LiteLLM / direct)
  • 9-metric scorecard + variance bands
  • 72-hour turnaround
  • Quarterly re-audit included

tier · 03

Continuous Routing

unlimited prompts · live data

$3,500/ month

We watch your traffic. Re-audit monthly. Route via our API or yours.

for teams routing at volume who want always-on optimization.

  • Unlimited prompts + task types
  • Live cost / latency / quality dashboard
  • Auto-route via API (or export weights)
  • Shared Slack channel
  • Monthly model-fit review

tier · custom

On-prem, regulated data, or something we have not seen before.

HIPAA, GDPR, SOC 2, air-gapped, custom evals, agent harnesses, fine-tunes in the mix. If your audit does not fit the three tiers above, it probably fits here. We scope, you decide, we ship.

  • on-prem / VPC deployment
  • custom eval rubrics
  • fine-tunes in the comparison
  • agent / tool-use audits
  • regulated data handling
  • named MSA + DPA

pricing

talk to us

quoted per scope · NDA on request

Scope a custom audit

or email aayush@trainmyllm.ai with a 2-line summary of your stack.

private beta · onboarding founding teams

Get in touch.

Leave your email and we’ll reach out, usually within a day, to scope your first audit and walk through what the receipts look like for your stack.

4,804 candidate models·131 providers·1,262 open-weights

// or email us directly at aayush@trainmyllm.ai