Cost Reduction
91.6%
vs all-heavyweight baseline
Draft Acceptance
94.0%
487 of 518 prompts
Draft Accuracy
98.2%
LLM-as-judge eval
P99 Latency
109ms
draft path, 50 req/s
Cost Reduction by Threshold
TCOST REDUCTIONACC
1.00
100%
1.25
99.6%
1.50
98.6%
1.75
98.4%
2.00
91.6%
98.2%
2.25
97.9%
2.50
97.9%
Confusion Matrix at T=2.0
Accept
Escalate
Actual
478
TN
29
FP
9
FN
2
TP
n=518T=2.0 bits
Threshold Sweep (n=518)

4 categories: factual, reasoning, code generation, ambiguous/creative. LLM-as-judge evaluation. Swept T=1.0..2.5 in 0.25 steps.

TESC%ACC%COST RED%
1.0068.9100.08.2
1.2549.299.631.0
1.5030.998.656.2
1.7513.998.481.2
2.00SEL6.098.291.6
2.250.497.999.0
2.500.097.999.2
Cost Model
Baseline (all-heavyweight)
$1.591
518 prompts
Routed at T=2.0
$0.133
91.6% reduction
DRAFTER OUTPUT$0.80/1M tok
HEAVY OUTPUT$10.00/1M tok
DRAFTER INPUT$0.20/1M tok
HEAVY INPUT$2.50/1M tok
DRAFTERgpt-4.1-nano
HEAVYWEIGHTgpt-4.1
EMBEDDINGStext-embedding-3-small
Entropy Analysis

H(X) = -SUM p(x) log2 p(x) over top-5 logprobs per token. 10-token sliding window.

ARITHMETICWhat is 347 + 892?
ACCEPT
avg 0.028peak 0.303
FACTUALWhat is photosynthesis?
ACCEPT
avg 0.210peak 1.835
AMBIGUOUSDefine ubiquitous
ESCALATE
avg 0.359peak 2.198
T=2.0 mean entropy peak (under T) peak (over T)
Routing State Machine
DRAFTINGentry
Drafter generates. Entropy computed per token, 10-token sliding window.
SPECULATINGH(X) > 0.8*T
Soft threshold crossed. Heavyweight fires in parallel via goroutine.
ACCEPTEDH(X) < T at EOS
Draft served. Response cached. Speculative heavyweight canceled if running.
ESCALATEDH(X) >= T
Heavyweight response used. Speculative head start reduces tail latency.
Key Parameters
HARD THRESHOLDT=2.0 bits
SOFT THRESHOLD0.8*T=1.6 bits
WINDOW10 tokens
EARLY EXIT10 tokens
CACHE COSINE>0.95
TOP LOGPROBS5
System Architecture
CLIENTGATEWAYGo net/http, goroutinesSEMANTIC CACHEQdrant + RedisHITMISSRETURN CACHEDDRAFTERgpt-4.1-nano, logprobsENTROPY ENGINEH(X) = -Σ p log₂ p, 10-tokH < TH ≥ TACCEPTserve draft, cacheESCALATEHEAVYWEIGHTgpt-4.1speculate at 0.8*T
Known Tradeoffs
CONFIDENT HALLUCINATION

Drafter produces wrong answers with low entropy. 9 FN in 518 prompts. Mitigated by conservative T, periodic audits, downstream feedback.

SPECULATIVE WASTE

Heavyweight fires at 0.8*T but drafter recovers. Wasted call. Under 10% of escalation cost. Latency savings on true escalations justify overhead.

CACHE CONSERVATISM

Cosine > 0.95 strict by design. Only draft-accepted responses cached. Escalated responses excluded. Re-draft beats stale answer.

ENTROPY vs CLASSIFICATION

Classifiers operate on the prompt. Entropy operates on the generation. Robust to distribution shift across all 4 categories.

GATEWAY
Go net/http
DRAFTER
gpt-4.1-nano
HEAVYWEIGHT
gpt-4.1
CACHE
Qdrant + Redis
OBS
Prometheus + Grafana