A friend running a legal-tech startup got his Q1 OpenAI invoice: $12,000. He moved the same workload to a rented GPU running an open-source model. New monthly cost: under $800.
The kicker wasn't just the money. His response times dropped from 4 seconds to under 1 second because he stopped sharing a server with thousands of strangers.
This is not a story about ideology. It's about spreadsheet math that finally became too obvious to ignore.
What changed
Meta released Llama 4 in April. Two models matter for most builders:
Scout — runs on a single GPU, handles up to 10 million tokens of context
Maverick — benchmarks competitively with GPT-4o on coding and reasoning tasks
Both are open-weight. You download them, you run them, you don't ask permission.
The hardware part got simple, too. An AWS G5 instance with a single A10G GPU costs about $1 per hour on-demand. That's roughly $730 per month if you leave it running nonstop.
Spot pricing cuts that to $0.43/hour if your workload can handle occasional restarts.
For context, 50 million tokens through GPT costs about $1,000 at current API rates. Past that point, the GPU becomes cheaper. At 200 million tokens, you might pay $4,000 to OpenAI or $730 to AWS.
This calculator lets you plug in your own numbers.
This breakdown goes deeper into the total cost of ownership, including the engineering time that most people forget to account for.
The real reason to care
Most people frame this as a cost play. It is. But that's the boring part, to be honest.
When you depend on someone else's API, your product moves at their pace. Models get retired without warning. Rate limits kick in during your product launch. Latency doubles because a competitor is running a load test.
Running your own model means you pick a version that works, set your own speed targets, and handle downtime on your terms. You also keep your data on your own machines, which matters more every time a new AI regulation drops.
The catch nobody mentions
Self-hosting is not free. You pay in engineering time.
You will deal with CUDA drivers, memory limits, and monitoring dashboards. If your team does not have someone who finds that kind of work interesting, the hidden tax will eat your savings. This guide covers the real-world infrastructure headaches most Twitter threads skip.
The honest rule of thumb: count your monthly tokens, estimate your growth for two quarters, and check if you have a technical person who actually wants to manage GPU instances. If all three point upward, test it. If not, stay on the API.
Key Takeaways
Open-weight models like Llama 4 now run at GPT-4-class quality on single-GPU setups costing roughly $1/hour. The hardware barrier has collapsed.
Self-hosting breaks even somewhere between 50 million and a few hundred million tokens per month, depending on your model choice and batching strategy. Use a calculator instead of guessing.
The deeper benefit is control. You eliminate surprise model retirements, provider rate limits, and latency spikes you can't fix.
Do not self-host unless you have the team capacity to manage inference infrastructure. Operational overhead is invisible until it isn't.
If your AI bill is growing faster than your revenue, run the numbers. Then ask whether cheaper is the same as more predictable.
Usually, it is not — and this time, it might be both.