back to trainmyllm.ai

case study · 01 ·  hearth

How Hearth cut their LLM bill 85% in three days.

A 120-person D2C customer-support copilot was running every workload on gpt-5 (classifying, summarizing, drafting) and paying $48k/month. We audited their actual traffic across 3.2M monthly calls, recommended three task-specific models, and shipped routing in 72 hours.

−85%

monthly cost

$487k

annual savings

94%

judge quality held

6 / 6

hard constraints met

Audit network diagram showing one input fanning through routing and judging into three task-specific outputs.plate · the audit

[01] chapter

The client

Hearth · customer-support copilot for D2C brands

Hearth ships an AI assistant inside Zendesk and Front. Their reps lean on it for three things: classifying incoming tickets into category / urgency / suggested route, summarizing long threads for shift hand-off, and drafting first-pass replies for the rep to edit. Every call has gone through openai/gpt-5 since they launched.

Volume crossed 3.2M calls/month at the end of Q1 and the bill hit $48k/month. The CFO asked the engineering lead a question they couldn’t answer: “do we actually need gpt-5 for ticket triage?”

CompanyHearth
VerticalD2C customer support · copilot
Team~120 people
Volume3.2M LLM calls / month
StackNext.js · Python (FastAPI) · Postgres · gpt-5 only
Spend (before)$48,000 / month
AuditedMay 2026 · 2 days end-to-end

[02] chapter

The setup

Three workloads. One model. A bill that wouldn't stop growing.

Diagram of three workloads converging into a single oversized model.before · single-model stack

Every workload routed to the same frontier model. The classifier needed nothing close to its reasoning depth. The drafting tasks didn’t need its context window. They were renting capability they never touched, three times over.

monthly cost by workload (before)

Ticket classification$28,800

60% of volume

Thread summarization$12,000

25% of volume

Reply drafting$7,200

15% of volume

[03] chapter

The brief

Six hard constraints. Cost was first, but never alone.

REQ · 01hard constraint

Monthly cost ceiling

required

≤ $25,000

delivered

$7,380

REQ · 02hard constraint

Classification p95

required

< 500ms

delivered

240ms

REQ · 03hard constraint

Drafting p95

required

< 2.0s

delivered

580ms

REQ · 04hard constraint

Judge quality vs. gpt-5

required

≥ 92%

delivered

94% (avg)

REQ · 05hard constraint

Data residency

required

US-only

delivered

US-only ✓

REQ · 06hard constraint

Strict JSON + tools

required

100% / required

delivered

100% / 96%

[04] chapter

The audit

312 real prompts. 8 finalists. 2h 18m wall-clock.

Hearth handed over 312 anonymized prompts across the three task types, sample-weighted to match their production traffic. We ran each task type as a separate sub-audit so we could recommend a model per workload, not one blended choice.

  • 312anonymized prompts collected
  • 4,804candidate models filtered
  • 41passed hard constraints (US, JSON, tools, 32k)
  • 8finalists shortlisted per task
  • 2,496inference calls + 750 judge calls
  • 2h 18mwall-clock end-to-end (10× parallel)
Schematic timeline of the audit: collect → filter → shortlist → fan-out → judge → rank.flow · the audit pipeline

[05] chapter

The winners

One model per workload. Each one beat gpt-5 on cost and speed without losing the task.

Schematic podium showing three winner model cards.three winners · one per workload
Ticket classification60% of volume

alibaba/qwen3-32b

judge score

92%

cost / 1M

$0.32 / 1M

−96%

p95 latency

240ms

−67%

JSON validity 100% · US-east only

Thread summarization25% of volume

deepseek/v3.1

judge score

96%

cost / 1M

$0.52 / 1M

−93%

p95 latency

480ms

−44%

Self-host fallback available

Reply drafting15% of volume

anthropic/claude-haiku-4.5

judge score

95%

cost / 1M

$1.76 / 1M

−78%

p95 latency

580ms

−52%

Tool-call accuracy 96%

[06] chapter

The new stack

OpenRouter routing rules. One day of integration.

Diagram of three workloads routed to three different task-specific models.after · routed stack

Hearth flipped the switch in a single 4-hour sprint. They put OpenRouter in front of every model call, added three routing rules keyed off their internal task_kind tag, and shipped behind a 10% canary for 48 hours before fully rolling out. Total integration: 72 hours from receipt to production.

router.config.ts · 14 lines
route(task_kind) {
  case "classify_ticket":
    return "alibaba/qwen3-32b"
  case "summarize_thread":
    return "deepseek/v3.1"
  case "draft_reply":
    return "anthropic/claude-haiku-4.5"
  default:
    return "openai/gpt-5"  // fallback
}

[07] chapter

The math

From $48,000/month to $7,380/month.

Waterfall chart showing baseline cost descending through smaller models, routing, region selection.waterfall · monthly cost
workloadbeforeafterdelta
Ticket classification$28,800$1,860-94%
Thread summarization$12,000$1,560-87%
Reply drafting$7,200$3,960-45%
total$48,000$7,38085%

$40.6k

saved / month

$487k

saved / year

72h

to integrate

[08] the takeaway

“We assumed gpt-5 was the floor. It turned out to be the ceiling, we just hadn’t looked.”

▸ M. Reyes · engineering lead, hearth

private beta · onboarding founding teams

Want a receipt for your stack?

Tell us what model you’re on and what you wish was different. We’ll come back with the audit.