case study · 01 · hearth
How Hearth cut their LLM bill 85% in three days.
A 120-person D2C customer-support copilot was running every workload on gpt-5 (classifying, summarizing, drafting) and paying $48k/month. We audited their actual traffic across 3.2M monthly calls, recommended three task-specific models, and shipped routing in 72 hours.
−85%
monthly cost
$487k
annual savings
94%
judge quality held
6 / 6
hard constraints met
plate · the audit[01] chapter
The client
Hearth · customer-support copilot for D2C brands
Hearth ships an AI assistant inside Zendesk and Front. Their reps lean on it for three things: classifying incoming tickets into category / urgency / suggested route, summarizing long threads for shift hand-off, and drafting first-pass replies for the rep to edit. Every call has gone through openai/gpt-5 since they launched.
Volume crossed 3.2M calls/month at the end of Q1 and the bill hit $48k/month. The CFO asked the engineering lead a question they couldn’t answer: “do we actually need gpt-5 for ticket triage?”
[02] chapter
The setup
Three workloads. One model. A bill that wouldn't stop growing.
before · single-model stackEvery workload routed to the same frontier model. The classifier needed nothing close to its reasoning depth. The drafting tasks didn’t need its context window. They were renting capability they never touched, three times over.
monthly cost by workload (before)
60% of volume
25% of volume
15% of volume
[03] chapter
The brief
Six hard constraints. Cost was first, but never alone.
Monthly cost ceiling
required
≤ $25,000
delivered
$7,380
Classification p95
required
< 500ms
delivered
240ms
Drafting p95
required
< 2.0s
delivered
580ms
Judge quality vs. gpt-5
required
≥ 92%
delivered
94% (avg)
Data residency
required
US-only
delivered
US-only ✓
Strict JSON + tools
required
100% / required
delivered
100% / 96%
[04] chapter
The audit
312 real prompts. 8 finalists. 2h 18m wall-clock.
Hearth handed over 312 anonymized prompts across the three task types, sample-weighted to match their production traffic. We ran each task type as a separate sub-audit so we could recommend a model per workload, not one blended choice.
- 312anonymized prompts collected
- 4,804candidate models filtered
- 41passed hard constraints (US, JSON, tools, 32k)
- 8finalists shortlisted per task
- 2,496inference calls + 750 judge calls
- 2h 18mwall-clock end-to-end (10× parallel)
flow · the audit pipeline[05] chapter
The winners
One model per workload. Each one beat gpt-5 on cost and speed without losing the task.
three winners · one per workloadalibaba/qwen3-32b
judge score
92%
cost / 1M
$0.32 / 1M
−96%
p95 latency
240ms
−67%
▸ JSON validity 100% · US-east only
deepseek/v3.1
judge score
96%
cost / 1M
$0.52 / 1M
−93%
p95 latency
480ms
−44%
▸ Self-host fallback available
anthropic/claude-haiku-4.5
judge score
95%
cost / 1M
$1.76 / 1M
−78%
p95 latency
580ms
−52%
▸ Tool-call accuracy 96%
[06] chapter
The new stack
OpenRouter routing rules. One day of integration.
after · routed stackHearth flipped the switch in a single 4-hour sprint. They put OpenRouter in front of every model call, added three routing rules keyed off their internal task_kind tag, and shipped behind a 10% canary for 48 hours before fully rolling out. Total integration: 72 hours from receipt to production.
route(task_kind) {
case "classify_ticket":
return "alibaba/qwen3-32b"
case "summarize_thread":
return "deepseek/v3.1"
case "draft_reply":
return "anthropic/claude-haiku-4.5"
default:
return "openai/gpt-5" // fallback
}[07] chapter
The math
From $48,000/month to $7,380/month.
waterfall · monthly cost$40.6k
saved / month
$487k
saved / year
72h
to integrate
[08] the takeaway
“We assumed gpt-5 was the floor. It turned out to be the ceiling, we just hadn’t looked.”
private beta · onboarding founding teams
Want a receipt for your stack?
Tell us what model you’re on and what you wish was different. We’ll come back with the audit.