Draft-Thinker | Cost-Aware LLM Gateway

Cost Reduction

91.6%

vs all-heavyweight baseline

Draft Acceptance

94.0%

487 of 518 prompts

Draft Accuracy

98.2%

LLM-as-judge eval

P99 Latency

109ms

draft path, 50 req/s

Cost Reduction by Threshold

TCOST REDUCTIONACC

1.00

100%

1.25

99.6%

1.50

98.6%

1.75

98.4%

2.00

91.6%

98.2%

2.25

97.9%

2.50

97.9%

Confusion Matrix at T=2.0

Escalate

Actual

478

n=518T=2.0 bits

Threshold Sweep (n=518)

4 categories: factual, reasoning, code generation, ambiguous/creative. LLM-as-judge evaluation. Swept T=1.0..2.5 in 0.25 steps.

T	ESC%	ACC%	COST RED%	ACCEPT	ESCAL
1.00	68.9	100.0	8.2	161	357
1.25	49.2	99.6	31.0	263	255
1.50	30.9	98.6	56.2	358	160
1.75	13.9	98.4	81.2	446	72
2.00SEL	6.0	98.2	91.6	487	31
2.25	0.4	97.9	99.0	516	2
2.50	0.0	97.9	99.2	518	0

Cost Model

Baseline (all-heavyweight)

$1.591

518 prompts

Routed at T=2.0

$0.133

91.6% reduction

DRAFTER OUTPUT$0.80/1M tok

HEAVY OUTPUT$10.00/1M tok

DRAFTER INPUT$0.20/1M tok

HEAVY INPUT$2.50/1M tok

DRAFTERgpt-4.1-nano

HEAVYWEIGHTgpt-4.1

EMBEDDINGStext-embedding-3-small

Entropy Analysis

H(X) = -SUM p(x) log2 p(x) over top-5 logprobs per token. 10-token sliding window.

ARITHMETICWhat is 347 + 892?

avg 0.028peak 0.303

FACTUALWhat is photosynthesis?

avg 0.210peak 1.835

AMBIGUOUSDefine ubiquitous

ESCALATE

avg 0.359peak 2.198

T=2.0 mean entropy peak (under T) peak (over T)

Routing State Machine

DRAFTINGentry

Drafter generates. Entropy computed per token, 10-token sliding window.

SPECULATINGH(X) > 0.8*T

Soft threshold crossed. Heavyweight fires in parallel via goroutine.

ACCEPTEDH(X) < T at EOS

Draft served. Response cached. Speculative heavyweight canceled if running.

ESCALATEDH(X) >= T

Heavyweight response used. Speculative head start reduces tail latency.

Key Parameters

HARD THRESHOLDT=2.0 bits

SOFT THRESHOLD0.8*T=1.6 bits

WINDOW10 tokens

EARLY EXIT10 tokens

CACHE COSINE>0.95

TOP LOGPROBS5

System Architecture

Known Tradeoffs

CONFIDENT HALLUCINATION

Drafter produces wrong answers with low entropy. 9 FN in 518 prompts. Mitigated by conservative T, periodic audits, downstream feedback.

SPECULATIVE WASTE

Heavyweight fires at 0.8*T but drafter recovers. Wasted call. Under 10% of escalation cost. Latency savings on true escalations justify overhead.

CACHE CONSERVATISM

Cosine > 0.95 strict by design. Only draft-accepted responses cached. Escalated responses excluded. Re-draft beats stale answer.

ENTROPY vs CLASSIFICATION

Classifiers operate on the prompt. Entropy operates on the generation. Robust to distribution shift across all 4 categories.

GATEWAY

Go net/http

DRAFTER

gpt-4.1-nano

HEAVYWEIGHT

gpt-4.1

CACHE

Qdrant + Redis

OBS

Prometheus + Grafana