case study · 01 · hearth

How Hearth cut their LLM bill 85% in three days.

A 120-person D2C customer-support copilot was running every workload on gpt-5 (classifying, summarizing, drafting) and paying $48k/month. We audited their actual traffic across 3.2M monthly calls, recommended three task-specific models, and shipped routing in 72 hours.

−85%

monthly cost

$487k

annual savings

94%

judge quality held

6 / 6

hard constraints met

Audit network diagram showing one input fanning through routing and judging into three task-specific outputs.

plate · the audit

[01] chapter

The client

Hearth · customer-support copilot for D2C brands

Hearth ships an AI assistant inside Zendesk and Front. Their reps lean on it for three things: classifying incoming tickets into category / urgency / suggested route, summarizing long threads for shift hand-off, and drafting first-pass replies for the rep to edit. Every call has gone through openai/gpt-5 since they launched.

Volume crossed 3.2M calls/month at the end of Q1 and the bill hit $48k/month. The CFO asked the engineering lead a question they couldn’t answer: “do we actually need gpt-5 for ticket triage?”

CompanyHearth

VerticalD2C customer support · copilot

Team~120 people

Volume3.2M LLM calls / month

StackNext.js · Python (FastAPI) · Postgres · gpt-5 only

Spend (before)$48,000 / month

AuditedMay 2026 · 2 days end-to-end

[02] chapter

The setup

Three workloads. One model. A bill that wouldn't stop growing.

Diagram of three workloads converging into a single oversized model.

before · single-model stack

Every workload routed to the same frontier model. The classifier needed nothing close to its reasoning depth. The drafting tasks didn’t need its context window. They were renting capability they never touched, three times over.

monthly cost by workload (before)

Ticket classification$28,800

60% of volume

Thread summarization$12,000

25% of volume

Reply drafting$7,200

312 real prompts. 8 finalists. 2h 18m wall-clock.

Hearth handed over 312 anonymized prompts across the three task types, sample-weighted to match their production traffic. We ran each task type as a separate sub-audit so we could recommend a model per workload, not one blended choice.

312anonymized prompts collected
4,804candidate models filtered
41passed hard constraints (US, JSON, tools, 32k)
8finalists shortlisted per task
2,496inference calls + 750 judge calls
2h 18mwall-clock end-to-end (10× parallel)

Schematic timeline of the audit: collect → filter → shortlist → fan-out → judge → rank.

flow · the audit pipeline

[05] chapter

The winners

One model per workload. Each one beat gpt-5 on cost and speed without losing the task.

Schematic podium showing three winner model cards.

three winners · one per workload

▶ Ticket classification60% of volume

alibaba/qwen3-32b

judge score

92%

cost / 1M

$0.32 / 1M

−96%

p95 latency

240ms

−67%

▸ JSON validity 100% · US-east only

▶ Thread summarization25% of volume

deepseek/v3.1

judge score

96%

cost / 1M

$0.52 / 1M

−93%

p95 latency

480ms

−44%

▸ Self-host fallback available

▶ Reply drafting15% of volume

anthropic/claude-haiku-4.5

judge score

95%

cost / 1M

$1.76 / 1M

−78%

p95 latency

580ms

−52%

▸ Tool-call accuracy 96%

[06] chapter

The new stack

OpenRouter routing rules. One day of integration.

Diagram of three workloads routed to three different task-specific models.

after · routed stack

Hearth flipped the switch in a single 4-hour sprint. They put OpenRouter in front of every model call, added three routing rules keyed off their internal task_kind tag, and shipped behind a 10% canary for 48 hours before fully rolling out. Total integration: 72 hours from receipt to production.

router.config.ts · 14 lines

route(task_kind) {
  case "classify_ticket":
    return "alibaba/qwen3-32b"
  case "summarize_thread":
    return "deepseek/v3.1"
  case "draft_reply":
    return "anthropic/claude-haiku-4.5"
  default:
    return "openai/gpt-5"  // fallback
}

[07] chapter

The math

From $48,000/month to $7,380/month.

Waterfall chart showing baseline cost descending through smaller models, routing, region selection.

waterfall · monthly cost

workloadbeforeafterdelta

Ticket classification$28,800$1,860-94%

Thread summarization$12,000$1,560-87%

Reply drafting$7,200$3,960-45%

total$48,000$7,380−85%

$40.6k

saved / month

$487k

saved / year

72h

to integrate

[08] the takeaway

“We assumed gpt-5 was the floor. It turned out to be the ceiling, we just hadn’t looked.”
▸ M. Reyes · engineering lead, hearth

private beta · onboarding founding teams

Want a receipt for your stack?

Tell us what model you’re on and what you wish was different. We’ll come back with the audit.

Get in touch $aayush@trainmyllm.ai