the engineering team at an ai startup builds a quality scoring pipeline in march. llm-as-judge runs against a sample of production traces, scores them on a rubric, and surfaces regressions before customers notice. it's a good piece of infrastructure. it saves the company at least one rollback a month. the founder approves the project without asking what it costs to run.

in september, finance pulls the close together and the founder reads the gross margin line and frowns. 71%, not the 76% they've been telling the board. the engineering tools line on the p&l is up 60% year over year and nobody can quite say why. the answer is sitting in a vercel project somebody called eval-runner, processing about 100,000 traces a day at a unit cost that quietly tripled when the team upgraded the judge model in june.

eval cost is the line item nobody budgets for. it's also, at most ai companies that have crossed a million monthly calls, between 12% and 20% of inference cost — booked in engineering tools instead of cogs, where it would actually tell the founder what the margin really is. it's the same blind spot as the rest of the bill, now that inference is the new payroll.

what's actually happening when evals run

an eval pipeline samples some percentage of production calls and re-runs them against a judge model. the judge gets the original prompt, the production response, a rubric, and instructions to score. the rubric prompt is typically 3-5x longer than the production prompt because it has to explain what good looks like. the judge is usually a higher-tier flagship because the cheap model isn't reliable enough to grade work. and the output is structured json, longer than a typical user-facing response.

put numbers on it. a startup making one million production calls a month at $0.02 per call spends $20k on inference. the eval pipeline samples 10% of those calls at 3x the input tokens, a flagship that costs 5x per token, and structured output that runs about 2x the production response length. the unit cost of an eval is roughly 25-30x the unit cost of the call it grades.

ai cogs · the eval line nobody books$1m calls/mo · 10% sampled · flagship judge

production inference

$20^K

1m calls

eval pipeline

$3^K

15% of cogs

reported gm

76^%

without evals

true gm

71^%

-5 points

the eval line is $3,000 a month — not catastrophic in absolute terms, but it's 15% of inference cost and 5 points of gross margin if you book it correctly. the company reporting 76% is operating at 71%. that gap is the difference between a saas-grade multiple and an ai-discount multiple at the next priced round. the partner running diligence will ask which line the eval pipeline sits in.

why the line lands in the wrong place

the eval system was built by engineering and lives in the engineering org's cloud account. the api key the eval pipeline uses sits next to internal-tools spend like ci/cd, observability, and the analytics warehouse. by default, the finance team books it where the invoice lands, not where the work belongs.

gaap convention is also unsettled here. ai cogs is a 2024 concept. there's no canonical chart-of-accounts template that says "llm-as-judge spend belongs above the gross margin line." every founder is making this call without precedent. most default to "if customers don't directly consume it, it's not cogs." that's the wrong rule. the right rule is "if it scales with usage and you can't ship the product without it, it's cogs."

and underneath both, founders don't want it in cogs. moving the line moves the gross margin number, and the gross margin number is on the deck.

the rule that actually works

the test for whether eval spend is cogs is one question: does the spend scale linearly with production usage, and is the product worse without it? for any honest ai company, the answer to both is yes. evals scale with calls (more calls means more samples, even at a fixed sampling rate). and a company that turns off evals to save money will ship a regression within a quarter and lose customers.

book it in cogs. the reported margin will be 3-5 points lower. the next investor reviewing your numbers will recognize the discipline immediately. running a defensible 71% beats running an indefensible 76% that gets unwound in diligence.

the question isn't where the invoice lands — it's whether the line scales with usage and whether the product still works without it.

what good ops looks like

a few habits separate the ai companies that have this under control from the ones that don't.

eval cost as a percentage of inference, tracked monthly. target band is 8-12%. above 15%, the sampling rate is too high or the judge model is overspecced. below 5%, the team is probably under-evaluating and shipping regressions. it's a knob worth tuning every quarter.

sampling rate set by risk, not by budget. high-stakes flows (anything that touches payments, code execution, or contracts) get sampled at 30-50%. low-stakes chat completions get sampled at 1-3%. founders who set a single global rate burn money on the easy stuff and underspend on the dangerous stuff.

judge model selection reviewed at each new model release. the cheap-but-good-enough judge from six months ago may now be the wrong tier. running gpt-5 as the judge on a product that uses gpt-4-mini in production is overkill — and it's most of where eval cost creeps in unexamined. it's the quieter cousin of the model upgrade margin shock that kills gross margin without telling you.

how zift handles this

zift breaks out your model-provider invoices by api key and tags them against your category structure. eval pipelines land in a cogs · eval bucket and the monday briefing reports eval-as-percent-of-inference alongside the rest of cogs. when the ratio drifts above 15% in any week, the alert names the api key, the model, and the change in spend.

if you're a finance lead at an ai-first series b team running multiple model providers and a distinct eval infra stack, zift handles that too.

the gross margin you don't report honestly is the one a partner unwinds at the term sheet stage. better to see the truth on the monday briefing than at the diligence meeting.

Eval Cost Is the AI Line Item Nobody Budgets For

what's actually happening when evals run

why the line lands in the wrong place

the rule that actually works

what good ops looks like

how zift handles this

Finance reports to you, not the other way.

Eval Cost Is the AI Line Item Nobody Budgets For

what's actually happening when evals run

why the line lands in the wrong place

the rule that actually works

what good ops looks like

how zift handles this

More on this topic.

Marketplace Take Rate Is the Only Number That Matters. GMV Is Vanity.

The Investor Reference Call That Passes on You

The Model Upgrade Margin Shock: Why GPT-5 Killed Your Gross Margin Without Telling You

Finance reports to you, not the other way.