How Many Tokens Per Month Before Self-Hosting Your GPU Becomes Cheaper?

Apr 14
7 min read

If you've been running an AI service for any length of time, you've probably hit this question at some point.

"Is using an API actually the cheaper option? Or would it be better to just buy a GPU and run it ourselves?"

As model performance converges, cost has become the decisive battleground. Teams at every scale are starting to run the numbers on which approach is actually cheaper for their usage volume — and the answer changes significantly depending on how much you're actually using. This post breaks down the real cost structures of three infrastructure options, with concrete numbers, for teams evaluating their AI infrastructure strategy.

Three Infrastructure Approaches and Their Cost Structures

Before comparing costs, it's worth understanding how each option generates costs in the first place. What looks like a similar price can work out very differently depending on your usage scale — the economics flip depending on where you are.

1. Serverless API: Pay Only for What You Use

| Cost structure: No fixed costs + per-token billing

You pay per token consumed. The biggest advantages are that you don't need to manage GPU hardware directly, and you don't need to make a large infrastructure investment upfront. If your traffic is hard to predict or your usage volume is still modest, this is the most sensible starting point.

2. Cloud GPU Rental: Rent by the Hour

| Cost structure: Fixed hourly billing (proportional to runtime)

You rent GPU instances by the hour from AWS, GCP, or similar cloud providers. Costs accrue based on how long the instance is running. It's less elastic than a serverless API, but if your traffic is predictable and your usage crosses a certain threshold, it can become more cost-effective than paying per token.

3. Self-Hosting: Buy It Yourself, Run It Yourself

| Cost structure: Upfront hardware purchase + monthly electricity + maintenance

You purchase GPUs directly or build out your own server infrastructure. The upfront investment is significant, but ongoing costs reduce to electricity and maintenance. The caveat: this cost advantage only becomes meaningful at fairly high usage volumes.

The Real Cost of Self-Hosting: It's Not Just the Electricity Bill

The most common trap teams fall into when evaluating self-hosting is thinking "we just pay for electricity." Let's look at the actual cost breakdown for an RTX 4090.

※ Exchange rate reference: 1 USD = ₩1,480

Item	USD	KRW	Notes
Initial GPU purchase	$1,600	~₩2.37M	RTX 4090 consumer model
Monthly electricity	$46	~₩68,000	250W average draw, $0.12/kWh, 24 hrs × 30 days
24-month total	$2,704	~₩4.0M	Hardware + cumulative electricity
Monthly amortized cost	~$113/mo	~₩170K/mo	$1,600 ÷ 24 months + $46 electricity

Source: devtk.ai — Self-Host LLM vs API: Real Cost Breakdown 2026

The important thing to note here is that this calculation is based on electricity costs alone. Costs that regularly come up in real-world operations — securing GPU availability, maintenance, responding to model updates, buying additional hardware during traffic spikes — are not yet included. In practice, the actual TCO (Total Cost of Ownership) is substantially higher. We'll revisit this later.

The Breakeven Point: How Many Tokens Per Month Before Self-Hosting Wins?

Using the market average serverless API price of $2.00/1M tokens (source: devtk.ai), the cost crossover point between self-hosting (RTX 4090) and API is approximately 23 million tokens per month.

There's an important caveat here though. $2.00/1M tokens is on the expensive end for the open-source model API market. Open-source models are available from a wide range of cloud providers at significantly lower prices than closed models, which means using an open-source model drives the per-call cost down and pushes the breakeven point up. In other words, the range where serverless API remains the cheaper option becomes much wider.

Let's see exactly how much the numbers shift when we apply the Qwen model pricing available through Air API.

Air API Pricing

API Price	Breakeven (monthly tokens)	Notes	API Price
Qwen3.5-9B	$0.05 (₩75)	$0.15 (₩222)	$0.10 (₩148)
Qwen3.5-35B-A3B	$0.1623 (₩240)	$1.30 (₩1,924)	$0.73 (~₩1,080)

Breakeven Comparison

API Price	Breakeven (monthly tokens)	Notes
$2.00/1M tokens (market average)	~23M tokens	Source: devtk.ai
$0.73 (~₩1,080)/1M tokens (Qwen3.5-35B-A3B)	~155M tokens	Air API pricing
$0.10 (₩148)/1M tokens (Qwen3.5-9B)	~1.13B tokens	Air API pricing

With Qwen3.5-9B on Air API, the range where API remains cheaper than self-hosting extends all the way to approximately 1.13 billion tokens per month (1.1B). Most AI startups and SaaS teams won't reach that scale for a significant amount of time. Until you do, there's no need to invest in infrastructure right now.

Side-by-Side Cost Comparison by Usage Scenario

Rather than working purely from theoretical breakeven points, here's a more intuitive comparison of all three options across actual monthly token usage scenarios.

*Air Cloud (cloud GPU rental) reference: RTX 4090 at $0.50/hr (₩742)

Monthly Token Usage	Air API (Qwen3.5-9B)	Air Cloud RTX 4090 (10 hrs/day fixed)	Self-Hosted RTX 4090 (monthly amortized, fixed)
10M (10 million)	$1 (₩1,480)	$213 (₩220K)	$113 (₩170K)
50M (50 million)	$5 (₩7,400)	$213 (₩220K)	$113 (₩170K)
100M (100 million)	$10 (₩15K)	$213 (₩220K)	$113 (₩170K)
500M (500 million)	$50 (~₩74K)	$213 (₩220K)	$113 (₩170K)
1B (1 billion)	$100 (₩150K)	$213 (₩220K)	$113 (₩170K)
2B (2 billion)	$200 (₩300K)	$213 (₩220K)	$113 (₩170K)
5B (5 billion)	$500 (~₩740K)	$213* (₩220K*)	$113* (₩170K*)

*At 5B+ tokens, you're approaching the practical monthly processing limit of a single RTX 4090 instance — multi-GPU configurations would be required.

A few patterns emerge from the table. Below 1.13B tokens per month, Air API is clearly the cheapest option. Because there are no fixed costs, the economics become increasingly favorable the lower your usage is. Between 1.13B and 1.5B, self-hosting is technically the cheapest on pure cost terms, but for teams that want to operate flexibly without upfront hardware investment, Air Cloud becomes the practical alternative. Once you cross 1.5B tokens, Air Cloud becomes cheaper than API — and that's the point to start seriously evaluating a shift to cloud GPU rental or dedicated infrastructure.

The Hidden Costs of Self-Hosting: Recalculating with TCO

Earlier we set the monthly cost of self-hosting at $113, but that only accounts for hardware amortization and electricity. In practice, the picture changes considerably.

For server-grade GPUs (H100, A100), the scale of the cost difference becomes much more significant. One analysis found that 100 H100 units ($3M, ~₩4.4B in hardware) carry an actual 5-year TCO of $8.6M (~₩12.7B) when you factor in power, cooling, networking, personnel, and maintenance — roughly 2.9x the hardware purchase price. (Source: Introl Blog)

The same principle applies to consumer-grade GPUs like the RTX 4090. Several cost categories tend to get left out of the calculation:

Hidden Cost Item	How It's Typically Treated	Does It Actually Occur?
Networking / switch setup	Not calculated separately	Yes
Engineering time for model updates	Engineer hourly cost not included	Yes
Risk of scaling failure	Cannot be calculated	Yes — during traffic spikes
GPU availability / procurement	Assumes purchasable	Geopolitical risk (conflicts, etc.) can cause GPU/memory price volatility
Downtime recovery costs	Excluded	Yes

The real cost of self-hosting isn't just the $46 electricity bill. When you factor in the time cost of the engineers running the infrastructure, the cost of incident response, and opportunity cost, the TCO is a very different number from a simple electricity calculation.

When Cloud GPU Rental Makes Sense

Between self-hosting and serverless API, there's a clear middle option: cloud GPU rental. If your usage has grown to hundreds of millions of tokens per month, you need to fine-tune a specific model, or you're running a latency-sensitive real-time service, this is worth serious consideration.

Cases where cloud GPU rental is the right choice:

Teams with predictable monthly token usage at scale (hundreds of millions of tokens or more)
Teams that need to customize (fine-tune) specific models or run batch inference
Real-time services where response latency is critical
Teams that want lower per-unit costs than self-hosting without the infrastructure management burden

Air Cloud offers RTX 4090 instances at $0.71/hr (₩742), approximately 40% cheaper than comparable cloud services (AWS A10G). Auto-scaling capabilities also let you optimize resource utilization based on actual demand, reducing wasted spend on idle GPU time for services with variable traffic.

Conclusion: What to Choose, and When

The core of any AI infrastructure decision is simple: choose the cheapest option for your current usage scale.

The graph below shows how the costs of all three options intersect as monthly token usage grows. (Qwen3.5-9B, single RTX 4090, Air Cloud running 10 hours/day)

Legend

Purple line: Air API (Qwen3.5-9B, $0.10/1M tokens average) — linear increase proportional to usage

Green line: Air Cloud RTX 4090 ($0.50/hr × 10 hrs/day × 30 days = $150/mo) — flat fixed cost

Orange line: Self-hosted RTX 4090 ($113/mo amortized) — flat fixed cost

Two crossover points:

1.13B tokens/month: Above this point, self-hosting becomes cheaper than Air API.
1.5B tokens/month: Above this point, Air Cloud GPU rental becomes cheaper than API.

Most AI startups and SaaS teams take a significant amount of time to reach even the first crossover point (1.13B) during early-to-mid stage growth.

Usage Range	Recommended Option	Reason
Up to 1.13B tokens/month	Serverless API (Air API)	No fixed costs, no infrastructure to manage, competitive pricing
1.5B–10B tokens/month	Cloud GPU rental (Air Cloud)	Lower per-unit cost vs. API, flexible scaling without upfront investment
10B+ tokens/month	Dedicated infrastructure (Private Air Cloud / self-hosting)	Unit cost savings at scale — calculate TCO carefully

When evaluating AI infrastructure, start by identifying exactly which range your current service sits in. The most rational path is to start fast with serverless API, transition to cloud GPU rental once traffic becomes predictable at scale, and then evaluate dedicated infrastructure when truly large-scale operations are needed.

If you're not sure where to start, AIEEV's team can help you design the right infrastructure strategy for where you are today. Air Cloud's GPU instance rental (Air Container) also offers up to 25% discount for reservations of 6 months or more — a fast way to improve your cost structure once traffic reaches a certain threshold.

Start with the option that fits where you are right now.

Get Started with Air Cloud →

References

devtk.ai — Self-Host LLM vs API: Real Cost Breakdown 2026
Effloow — Self-Hosting LLMs vs Cloud APIs: Cost, Performance & Privacy in 2026
Introl Blog — GPU Infrastructure TCO 5-Year Cost Model
DEV Community — https://dev.to/czmilo/qwen3-tts-the-complete-2026-guide-to-open-source-voice-cloning-and-ai-speech-generation-1in6
AI Tool Analysis — https://aitoolanalysis.com/qwen3-tts-review/