Cost Reduction
91.6%
vs all-heavyweight baseline
Draft Acceptance
94.0%
487 of 518 prompts
Draft Accuracy
98.2%
LLM-as-judge eval
P99 Latency
109ms
draft path, 50 req/s
Cost Reduction by Threshold
TCOST REDUCTIONACC
1.00
100%
1.25
99.6%
1.50
98.6%
1.75
98.4%
2.00
91.6%
98.2%2.25
97.9%
2.50
97.9%
Confusion Matrix at T=2.0
Accept
Escalate
Actual
478
TN
29
FP
9
FN
2
TP
n=518T=2.0 bits
Threshold Sweep (n=518)
4 categories: factual, reasoning, code generation, ambiguous/creative. LLM-as-judge evaluation. Swept T=1.0..2.5 in 0.25 steps.
| T | ESC% | ACC% | COST RED% | ACCEPT | ESCAL |
|---|---|---|---|---|---|
| 1.00 | 68.9 | 100.0 | 8.2 | 161 | 357 |
| 1.25 | 49.2 | 99.6 | 31.0 | 263 | 255 |
| 1.50 | 30.9 | 98.6 | 56.2 | 358 | 160 |
| 1.75 | 13.9 | 98.4 | 81.2 | 446 | 72 |
| 2.00SEL | 6.0 | 98.2 | 91.6 | 487 | 31 |
| 2.25 | 0.4 | 97.9 | 99.0 | 516 | 2 |
| 2.50 | 0.0 | 97.9 | 99.2 | 518 | 0 |
Cost Model
Baseline (all-heavyweight)
$1.591
518 prompts
Routed at T=2.0
$0.133
91.6% reduction
DRAFTER OUTPUT$0.80/1M tok
HEAVY OUTPUT$10.00/1M tok
DRAFTER INPUT$0.20/1M tok
HEAVY INPUT$2.50/1M tok
DRAFTERgpt-4.1-nano
HEAVYWEIGHTgpt-4.1
EMBEDDINGStext-embedding-3-small
Entropy Analysis
H(X) = -SUM p(x) log2 p(x) over top-5 logprobs per token. 10-token sliding window.
ARITHMETICWhat is 347 + 892?
ACCEPTavg 0.028peak 0.303
FACTUALWhat is photosynthesis?
ACCEPTavg 0.210peak 1.835
AMBIGUOUSDefine ubiquitous
ESCALATEavg 0.359peak 2.198
T=2.0 mean entropy peak (under T) peak (over T)
Routing State Machine
DRAFTINGentry
Drafter generates. Entropy computed per token, 10-token sliding window.
SPECULATINGH(X) > 0.8*T
Soft threshold crossed. Heavyweight fires in parallel via goroutine.
ACCEPTEDH(X) < T at EOS
Draft served. Response cached. Speculative heavyweight canceled if running.
ESCALATEDH(X) >= T
Heavyweight response used. Speculative head start reduces tail latency.
Key Parameters
HARD THRESHOLDT=2.0 bits
SOFT THRESHOLD0.8*T=1.6 bits
WINDOW10 tokens
EARLY EXIT10 tokens
CACHE COSINE>0.95
TOP LOGPROBS5
System Architecture
Known Tradeoffs
CONFIDENT HALLUCINATION
Drafter produces wrong answers with low entropy. 9 FN in 518 prompts. Mitigated by conservative T, periodic audits, downstream feedback.
SPECULATIVE WASTE
Heavyweight fires at 0.8*T but drafter recovers. Wasted call. Under 10% of escalation cost. Latency savings on true escalations justify overhead.
CACHE CONSERVATISM
Cosine > 0.95 strict by design. Only draft-accepted responses cached. Escalated responses excluded. Re-draft beats stale answer.
ENTROPY vs CLASSIFICATION
Classifiers operate on the prompt. Entropy operates on the generation. Robust to distribution shift across all 4 categories.
GATEWAY
Go net/http
DRAFTER
gpt-4.1-nano
HEAVYWEIGHT
gpt-4.1
CACHE
Qdrant + Redis
OBS
Prometheus + Grafana