Spec: LLM Provider Interface
- Status: Draft
- Last amended: 2026-06-03 (Antigravity SSE and history sanitation)
- Constrained by: ADR-0004, ADR-0011, ADR-0012, ADR-0014, ADR-0024, ADR-0026, ADR-0028
- Implements:
packages/llm/
Purpose
This spec defines @kaged/llm — the package that talks to LLM providers. It takes a resolved route (provider name, model ID, credentials) and a conversation context, and returns a stream of typed events representing the assistant's response.
This package is normative for:
- The provider adapter interface and its contract.
- The streaming event protocol (SSE parsing, event emission, partial-JSON tool arguments).
- The six API shapes supported:
anthropic-messages,openai-completions,openai-responses,openai-codex-responses,google-generative-ai,antigravity. - The message, tool, and context types that cross the package boundary.
- The error taxonomy for provider failures.
- AbortSignal integration for request cancellation.
It is not normative for:
- Model alias resolution or fallback chains (that's
provider-routerin@kaged/harness). - Credential storage or operator-local config (that's
@kaged/local-config). - Session state machines or run lifecycle (that's
session-manager.md). - The WebSocket relay from daemon to UI (that's
http-api.md). - Model catalog persistence or operator-local config (that's
@kaged/local-config); this package provides the discovery functions (listModels,humanizeModelId) consumed by the daemon's catalog endpoints.
This package is a pure provider interface. It is not a general-purpose LLM framework. It exists to translate kaged's internal context into provider-specific HTTP requests and translate provider-specific SSE responses back into kaged's event stream.
Per ADR-0014, this package is also the single provider path for kaged. It exposes a LanguageModelV2 shim (see § Mastra integration) that Mastra v1.x consumes as Agent.model, so the agent loop and the direct call path both route through the same provider adapters. There is no parallel @ai-sdk/<provider> dependency tree.
Constraints (from ADRs)
| Constraint | Source |
|---|---|
| Runtime is Bun; no Node-isms in production code | ADR-0004 |
No official SDKs (@anthropic-ai/sdk, openai); pure fetch-based |
Operator decision (see § Design rationale) |
| Projects are portable; provider credentials are operator-local | ADR-0011 |
Single provider path for all LLM calls; expose a LanguageModelV2 shim for Mastra; no @ai-sdk/<provider> deps |
ADR-0014 |
Design rationale
pi-ai (the reference implementation in reference/oh-my-pi/packages/ai/) uses official SDKs for Anthropic and imports from @bufbuild/protobuf for Cursor. kaged deliberately avoids these:
- Fewer deps. Official SDKs pull in HTTP clients, retry logic, and polyfills we don't need — Bun's
fetchhandles everything. - Smaller surface. We support 4 API shapes, not 11. The SDK abstraction layers add indirection we'd have to fight.
- ARM64 safety. No native deps that wobble on low-resource Linux hosts.
The pi-ai source is the reference for wire protocols (request/response shapes, SSE event formats, header requirements). We rewrite the transport in pure TS with fetch.
Types
All types are kaged's own. They are informed by pi-ai's shapes but are not imported from it.
Message types
interface TextContent {
type: "text";
text: string;
}
interface ThinkingContent {
type: "thinking";
thinking: string;
}
interface ImageContent {
type: "image";
data: string; // base64
mimeType: string; // e.g. "image/png"
}
interface ToolCall {
type: "toolCall";
id: string;
name: string;
arguments: Record<string, unknown>;
}
interface UserMessage {
role: "user";
content: string | (TextContent | ImageContent)[];
timestamp: number;
}
interface AssistantMessage {
role: "assistant";
content: (TextContent | ThinkingContent | ToolCall)[];
provider: string;
model: string;
usage: Usage;
stopReason: StopReason;
errorMessage?: string;
timestamp: number;
duration?: number;
ttft?: number; // time to first token (ms)
}
interface ToolResultMessage {
role: "toolResult";
toolCallId: string;
toolName: string;
content: (TextContent | ImageContent)[];
isError: boolean;
timestamp: number;
}
interface SystemMessage {
role: "system";
content: string;
}
type Message = UserMessage | SystemMessage | AssistantMessage | ToolResultMessage;
Context
interface Tool {
name: string;
description: string;
parameters: Record<string, unknown>; // JSON Schema object
strict?: boolean;
}
interface Context {
systemPrompt?: string[];
messages: Message[];
tools?: Tool[];
}
Usage & Stop
interface Usage {
input: number;
output: number;
cacheRead: number;
cacheWrite: number;
totalTokens: number;
reasoningTokens?: number;
cost: {
input: number;
output: number;
reasoning: number;
cacheRead: number;
cacheWrite: number;
total: number;
};
}
type StopReason = "stop" | "length" | "toolUse" | "error" | "aborted";
Stream events
The event stream is an AsyncIterable<StreamEvent> with a .result() method that resolves to the final AssistantMessage.
type StreamEvent =
| { type: "start"; partial: AssistantMessage }
| { type: "text_start"; contentIndex: number; partial: AssistantMessage }
| { type: "text_delta"; contentIndex: number; delta: string; partial: AssistantMessage }
| { type: "text_end"; contentIndex: number; content: string; partial: AssistantMessage }
| { type: "thinking_start"; contentIndex: number; partial: AssistantMessage }
| { type: "thinking_delta"; contentIndex: number; delta: string; partial: AssistantMessage }
| { type: "thinking_end"; contentIndex: number; content: string; partial: AssistantMessage }
| { type: "toolcall_start"; contentIndex: number; partial: AssistantMessage }
| { type: "toolcall_delta"; contentIndex: number; delta: string; partial: AssistantMessage }
| { type: "toolcall_end"; contentIndex: number; toolCall: ToolCall; partial: AssistantMessage }
| { type: "done"; reason: "stop" | "length" | "toolUse"; message: AssistantMessage }
| { type: "error"; reason: "error" | "aborted"; error: AssistantMessage };
These events match pi-ai's AssistantMessageEvent shape exactly. This is intentional — the event protocol is battle-tested and the daemon's WebSocket relay can forward them without transformation.
Provider adapter interface
/** What the harness hands to @kaged/llm after route resolution. */
interface ProviderRoute {
providerName: string;
modelId: string;
apiKey: string;
baseUrl?: string;
defaultOptions?: Record<string, unknown>;
}
/** Options for a stream request. */
interface StreamOptions {
signal?: AbortSignal;
temperature?: number;
maxTokens?: number;
topP?: number;
stopSequences?: string[];
reasoning?: EffortLevel;
headers?: Record<string, string>;
}
type EffortLevel = "minimal" | "low" | "medium" | "high";
/** The main entry point. */
function streamModel(
route: ProviderRoute,
context: Context,
options?: StreamOptions,
): LlmEventStream;
/** Convenience: await the full response. */
async function completeModel(
route: ProviderRoute,
context: Context,
options?: StreamOptions,
): Promise<AssistantMessage>;
API shape resolution
The provider name determines which API shape to use. The mapping is configured per-provider, not per-model:
| Provider name | API shape | Notes |
|---|---|---|
anthropic |
anthropic-messages |
Direct Anthropic API |
openai |
openai-completions |
Chat completions (v1/chat/completions) |
google |
google-generative-ai |
Gemini generateContent |
xai |
openai-completions |
OpenAI-compatible |
groq |
openai-completions |
OpenAI-compatible |
deepseek |
openai-completions |
OpenAI-compatible |
mistral |
openai-completions |
OpenAI-compatible |
ollama |
openai-completions |
OpenAI-compatible (/v1/chat/completions) |
vllm |
openai-completions |
OpenAI-compatible |
lm-studio |
openai-completions |
OpenAI-compatible |
litellm |
openai-completions |
OpenAI-compatible |
openrouter |
openai-completions |
OpenAI-compatible |
antigravity |
antigravity |
Google Cloud Code proxy (Antigravity); OAuth bearer token auth |
codex |
openai-codex-responses |
OpenAI Codex (ChatGPT backend); OAuth via auth.openai.com, PKCE flow; uses input array, not messages |
copilot |
openai-completions |
GitHub Copilot (multi-provider via Copilot subscription); device-code OAuth via github.com; custom headers |
zai |
anthropic-messages |
Z.AI (GLM Coding Plan); Anthropic-compatible proxy at api.z.ai |
This mapping lives in a provider-map.ts file. Operators can extend it via local config (future; v0 ships with the hardcoded map above).
Driver catalog
@kaged/llm is the single source of truth for driver metadata. The daemon relays the full catalog to the UI via GET /api/v1/local/providers (as known_drivers); the UI renders provider config forms dynamically from it — no hardcoded driver lists in the frontend.
Types
/** Auth modes a driver supports. */
type DriverAuthMode = "api_key" | "oauth" | "none";
/** Full driver metadata — the shape the daemon relays to the UI. */
interface DriverInfo {
/** Driver identifier (e.g. "anthropic", "ollama"). */
name: string;
/** Human-readable label (e.g. "Anthropic", "Ollama"). */
label: string;
/** Which API shape this driver speaks. */
apiShape: ApiShape;
/** Default base URL, if any. */
defaultBaseUrl: string | undefined;
/** Default test model for the provider test endpoint. */
testModel: string;
/** Whether this is a local/self-hosted driver. */
local: boolean;
/** Auth modes this driver supports, ordered by preference. */
authModes: DriverAuthMode[];
}
Public functions
| Function | Returns | Purpose |
|---|---|---|
listDrivers() |
DriverInfo[] |
Full catalog with metadata for all known drivers |
knownProviders() |
string[] |
Driver names only (legacy; use listDrivers for new code) |
resolveApiShape(name) |
ApiShape | undefined |
API shape for a driver name |
getDefaultBaseUrl(name) |
string | undefined |
Default base URL for a driver |
getDriverTestModel(name) |
string |
Test model for provider connectivity checks |
Driver catalog (v0)
| Driver | Label | API shape | Default base URL | Local | Auth modes | Test model |
|---|---|---|---|---|---|---|
anthropic |
Anthropic | anthropic-messages |
https://api.anthropic.com |
no | api_key |
claude-sonnet-4-20250514 |
openai |
OpenAI | openai-completions |
https://api.openai.com |
no | api_key |
gpt-4.1-mini |
google |
Google (Gemini) | google-generative-ai |
https://generativelanguage.googleapis.com |
no | api_key |
gemini-2.0-flash |
xai |
xAI (Grok) | openai-completions |
https://api.x.ai |
no | api_key |
grok-3-mini-fast |
groq |
Groq | openai-completions |
https://api.groq.com/openai |
no | api_key |
llama-3.3-70b-versatile |
deepseek |
DeepSeek | openai-completions |
https://api.deepseek.com |
no | api_key |
deepseek-chat |
mistral |
Mistral | openai-completions |
https://api.mistral.ai |
no | api_key |
mistral-small-latest |
fireworks |
Fireworks AI | openai-completions |
https://api.fireworks.ai/inference |
no | api_key |
— |
together |
Together AI | openai-completions |
https://api.together.xyz |
no | api_key |
— |
cerebras |
Cerebras | openai-completions |
https://api.cerebras.ai |
no | api_key |
— |
openrouter |
OpenRouter | openai-completions |
https://openrouter.ai/api |
no | api_key |
openai/gpt-4.1-mini |
ollama |
Ollama | openai-completions |
http://127.0.0.1:11434 |
yes | none, api_key |
llama3.2 |
vllm |
vLLM | openai-completions |
http://127.0.0.1:8000 |
yes | none, api_key |
— |
lm-studio |
LM Studio | openai-completions |
http://127.0.0.1:1234 |
yes | none, api_key |
— |
litellm |
LiteLLM | openai-completions |
http://localhost:4000 |
yes | none, api_key |
— |
antigravity |
Antigravity | antigravity |
https://cloudcode-pa.googleapis.com |
no | oauth |
gemini-2.5-flash |
codex |
OpenAI Codex | openai-codex-responses |
https://chatgpt.com/backend-api |
no | oauth |
gpt-5-codex |
copilot |
GitHub Copilot | openai-completions |
https://api.githubcopilot.com |
no | oauth |
gpt-4o |
zai |
Z.AI (GLM Coding Plan) | anthropic-messages |
https://api.z.ai/api/anthropic |
no | api_key |
glm-5.1 |
Local drivers list none first in authModes — the UI uses this to suppress the credentials section by default.
UI integration
The daemon's GET /api/v1/local/providers response includes known_drivers: DriverInfo[]. The UI consumes this to:
- Populate driver
<select>withd.labelas display text,d.nameas value. - Pre-fill base URL from
d.defaultBaseUrlwhen the operator selects a driver. - Conditionally render credentials — hidden when the driver's
authModesdoes not includeapi_key. - Show contextual badges — "local" for
d.local === truedrivers without a key, red "no key" warning for remote drivers missing credentials.
No driver metadata is hardcoded in the UI. Adding a new driver to provider-map.ts is sufficient — the UI picks it up automatically.
Model discovery
@kaged/llm provides two functions for model catalog workflows. The daemon's model catalog endpoints (http-api.md § Model catalog) consume these; persistence is handled by @kaged/local-config.
listModels(options)
Fetches the live model list from a provider's API. Supports all five API shapes:
- OpenAI-compatible (
openai-completions,openai-responses):GET /v1/models, extracts fromdata[]array. - Anthropic (
anthropic-messages):GET /v1/modelswith pagination (after_id, up to 10 pages of 100). - Google (
google-generative-ai):GET /v1beta/modelswith pagination (pageToken, up to 25 pages), filtered to models supportinggenerateContent. Stripsmodels/prefix from IDs. - Antigravity (
antigravity): Returns{ ok: false }with an informational error — Antigravity does not expose a model listing endpoint. Models must be configured in the project DSL.
interface ListModelsOptions {
driver: string;
apiKey: string;
baseUrl?: string;
signal?: AbortSignal;
}
interface ListModelsResult {
ok: boolean;
models: ModelInfo[];
error?: string;
}
interface ModelInfo {
id: string;
name: string;
}
function listModels(options: ListModelsOptions): Promise<ListModelsResult>;
Returns { ok: false, error } on unknown driver, missing base URL, HTTP errors, or fetch failures. Never throws — all errors are captured in the result.
Models are sorted alphabetically by id. Names come from provider metadata when available (display_name for Anthropic, displayName for Google); fall back to the raw id.
humanizeModelId(id)
Generates a human-readable display name from a model identifier. Used as the fallback when no explicit name is stored in the operator's model catalog.
function humanizeModelId(id: string): string;
Behavior:
- Replaces hyphens and underscores with spaces.
- Collapses consecutive separators.
- Title-cases each word (first letter uppercase).
- Preserves dots and version numbers.
Examples:
| Input | Output |
|---|---|
claude-sonnet-4-20250514 |
Claude Sonnet 4 20250514 |
gpt-4.1-mini |
Gpt 4.1 Mini |
gemini-2.0-flash |
Gemini 2.0 Flash |
deepseek-chat |
Deepseek Chat |
Model metadata catalog
@kaged/llm is the single source of truth for model metadata — capabilities, pricing, and context limits. The daemon and UI consume this metadata via the shared types; they never parse pricing data themselves.
Data source
The base catalog is the LiteLLM model_prices_and_context_window.json — a community-maintained JSON file covering 1000+ models across all major providers. @kaged/llm ships a bundled snapshot of this file as the base catalog. The snapshot is updated periodically (at release time, not at runtime in v0).
The LiteLLM JSON keys are "provider/model-id" (e.g. "anthropic/claude-sonnet-4-20250514", "openai/gpt-4.1-mini"). Each entry contains a superset of the fields below; @kaged/llm extracts only the fields it needs.
ModelMeta
The extracted metadata type for a single model. This is the shape the daemon and UI consume — never the raw LiteLLM JSON.
interface ModelMeta {
/** LiteLLM key (e.g. "anthropic/claude-sonnet-4-20250514"). */
key: string;
/** LiteLLM provider identifier (e.g. "anthropic", "openai", "vertex_ai"). */
litellmProvider: string;
/** Model mode — only "chat" models are relevant for kaged's agent loop. */
mode: string;
// --- Context limits ---
maxInputTokens: number | null;
maxOutputTokens: number | null;
// --- Pricing (USD per token) ---
pricing: {
input: number; // input_cost_per_token
output: number; // output_cost_per_token
reasoning: number | null; // output_cost_per_reasoning_token (null = same as output)
cacheRead: number | null; // cache_read_input_token_cost
cacheWrite: number | null; // cache_creation_input_token_cost
};
// --- Capabilities ---
capabilities: {
reasoning: boolean; // supports_reasoning
vision: boolean; // supports_vision
functionCalling: boolean; // supports_function_calling
streaming: boolean; // true for all chat models (kaged assumption)
promptCaching: boolean; // supports_prompt_caching
responseSchema: boolean; // supports_response_schema
systemMessages: boolean; // supports_system_messages
webSearch: boolean; // supports_web_search
audioInput: boolean; // supports_audio_input
audioOutput: boolean; // supports_audio_output
pdf: boolean; // supports_pdf_input
};
// --- Deprecation ---
deprecationDate: string | null; // ISO 8601 date string or null
}
Fields not present in the LiteLLM entry default to null (limits, optional pricing) or false (capabilities). pricing.reasoning defaults to null, meaning the caller should fall back to pricing.output for reasoning tokens when computing cost.
Key normalization
LiteLLM keys use a "litellm_provider/model-id" format that doesn't always match kaged's "provider:model" convention (kaged uses : as separator, and provider names are kaged-defined — see § API shape resolution). The catalog provides a lookup function that accepts kaged's native format:
/** Look up model metadata by kaged's "provider:model" identifier. */
function lookupModelMeta(provider: string, modelId: string): ModelMeta | null;
The function maps kaged provider names to LiteLLM provider prefixes internally (e.g. "anthropic" → "anthropic/", "openai" → "openai/", "google" → "gemini/", "xai" → "xai/", "deepseek" → "deepseek/", "groq" → "groq/", "openrouter" → "openrouter/", etc.). If no match is found, returns null — the caller (harness, daemon) proceeds without metadata. Missing metadata is never a fatal error.
calculateCost
A pure utility that computes the dollar cost of a completed LLM call from token counts and pricing metadata.
interface CostInput {
usage: Usage;
meta: ModelMeta | null;
}
interface CostBreakdown {
input: number;
output: number;
reasoning: number;
cacheRead: number;
cacheWrite: number;
total: number;
}
function calculateCost(input: CostInput): CostBreakdown;
If meta is null (unknown model), all costs are 0. If meta.pricing.reasoning is null, reasoning tokens are priced at meta.pricing.output. The function is pure — no side effects, no network calls.
Operator overrides and the resolution pipeline
Per ADR-0026, operators can override any metadata field per provider+model. Overrides are stored in the DB (@kaged/storage, model_overrides table), not in local.toml. The resolution order:
- Operator override from DB (highest priority)
- Bundled LiteLLM snapshot (lowest priority)
This allows operators to correct stale pricing, fix wrong context windows, or add metadata for models not in the LiteLLM catalog (e.g. self-hosted fine-tunes behind Ollama/vLLM).
Override storage shape
interface ModelOverride {
provider: string; // kaged provider name (e.g. "anthropic", "ollama")
modelId: string; // model ID (e.g. "claude-sonnet-4-20250514", "my-llama3")
field: string; // field name from ModelMeta (e.g. "maxInputTokens", "pricing.input")
value: string; // JSON-encoded value (number, boolean, string, or null)
updatedAt: number; // epoch ms
}
Sparse — only overridden fields have rows. The primary key is (provider, modelId, field).
Overridable fields
All scalar fields on ModelMeta:
| Field | Type | Notes |
|---|---|---|
maxInputTokens |
number | null |
Context window for compaction thresholds |
maxOutputTokens |
number | null |
Max output tokens |
pricing.input |
number |
USD per input token |
pricing.output |
number |
USD per output token |
pricing.reasoning |
number | null |
USD per reasoning token |
pricing.cacheRead |
number | null |
USD per cache read token |
pricing.cacheWrite |
number | null |
USD per cache write token |
capabilities.reasoning |
boolean |
|
capabilities.vision |
boolean |
|
capabilities.functionCalling |
boolean |
|
capabilities.promptCaching |
boolean |
|
capabilities.responseSchema |
boolean |
|
capabilities.systemMessages |
boolean |
|
capabilities.webSearch |
boolean |
|
capabilities.audioInput |
boolean |
|
capabilities.audioOutput |
boolean |
|
capabilities.pdf |
boolean |
|
deprecationDate |
string | null |
|
tokenizer |
string |
"tiktoken" | "gemini" | "llama" | "unknown" |
Nested fields use dot notation in the field column (e.g. pricing.input, capabilities.vision).
resolveModelMeta
The merge function. Replaces lookupModelMeta for callers that need override-aware metadata (harness, daemon, compaction).
interface ResolvedModelMeta {
meta: ModelMeta; // the merged result
sources: Record<string, "override" | "default">; // per-field origin tracking
}
function resolveModelMeta(
provider: string,
modelId: string,
overrides: ModelOverride[],
): ResolvedModelMeta;
Behavior:
- Start with the LiteLLM default (
lookupModelMeta). If no LiteLLM entry exists, start from a null-defaultModelMeta(all fields null/false, key and litellmProvider synthesized from inputs). - For each override, apply the value to the corresponding field. Dot-notation fields set nested values (e.g.
pricing.inputsetsmeta.pricing.input). - The
sourcesmap tracks which fields came from overrides vs defaults, enabling the UI to render visual distinction.
Models not in LiteLLM. When lookupModelMeta returns null, the override system builds a ModelMeta entirely from overrides. Missing fields default to null (numeric), false (capabilities), or "unknown" (tokenizer). The key field is set to {provider}/{modelId}, litellmProvider is set to the provider name.
Example: Override context window for a self-hosted model
const overrides: ModelOverride[] = [
{ provider: "ollama", modelId: "llama3.1:70b", field: "maxInputTokens", value: "131072", updatedAt: Date.now() },
{ provider: "ollama", modelId: "llama3.1:70b", field: "pricing.input", value: "0", updatedAt: Date.now() },
{ provider: "ollama", modelId: "llama3.1:70b", field: "pricing.output", value: "0", updatedAt: Date.now() },
];
const resolved = resolveModelMeta("ollama", "llama3.1:70b", overrides);
// resolved.meta.maxInputTokens === 131072 (from override)
// resolved.meta.pricing.input === 0 (from override)
// resolved.meta.capabilities.vision === false (default, no override)
// resolved.sources["maxInputTokens"] === "override"
// resolved.sources["capabilities.vision"] === "default"
Example: Correct stale LiteLLM pricing
const overrides: ModelOverride[] = [
{ provider: "anthropic", modelId: "claude-sonnet-4-20250514", field: "pricing.input", value: "0.000003", updatedAt: Date.now() },
];
const resolved = resolveModelMeta("anthropic", "claude-sonnet-4-20250514", overrides);
// resolved.meta.pricing.input === 0.000003 (override wins)
// resolved.meta.maxInputTokens === <from LiteLLM> (no override, uses default)
// resolved.sources["pricing.input"] === "override"
// resolved.sources["maxInputTokens"] === "default"
What this is NOT
- Not a runtime fetcher. v0 does not fetch the LiteLLM JSON at runtime. The bundled snapshot is the base. A future version may add periodic refresh with a configurable interval and local cache file.
- Not exhaustive. The catalog covers models present in LiteLLM's dataset. Local/self-hosted models (Ollama, vLLM, LM Studio) are unlikely to appear; their metadata comes from operator config or defaults to
null. - Not the model list.
listModels()fetches live model IDs from provider APIs.lookupModelMeta()enriches those IDs with static metadata from the bundled catalog. They are independent — a model can appear inlistModelswithout metadata, and the catalog can contain models the operator hasn't provisioned.
Token estimation
Per ADR-0024, the harness needs to estimate token usage before each LLM call to decide whether compaction should fire. @kaged/llm exposes the estimator.
estimateTokens
interface EstimateInput {
messages: Message[]; // the candidate message list
systemPrompt: string | string[]; // system prompt(s)
modelMeta: ModelMeta | null; // resolved via lookupModelMeta
reservedOutputTokens?: number; // budget reserved for the LLM's response (default 4096)
}
interface EstimateResult {
inputTokens: number; // estimated input tokens (messages + system)
reservedOutputTokens: number; // echoed back; the harness uses this for the threshold check
totalTokens: number; // inputTokens + reservedOutputTokens
fraction: number; // totalTokens / modelMeta.contextWindow
contextWindow: number | null; // from modelMeta; null if metadata unavailable
algorithm: "tiktoken" | "fallback"; // which estimator was used
}
function estimateTokens(input: EstimateInput): EstimateResult;
Behavior:
- Algorithm preference. When
modelMeta.tokenizer === "tiktoken"(most OpenAI and Anthropic models map cleanly), the estimator uses a local tiktoken implementation. When the model uses a different tokenizer (Gemini, Llama, etc.) ormodelMetais null, the estimator falls back to a character-count heuristic (chars / 3.5, rounded up). Thealgorithmfield reports which was used. - Conservative. The estimator over-estimates rather than under-estimates. Wrong-direction errors (estimating too few tokens, then hitting context-length at provider) are caught by the reactive-fallback path in the harness — but they're expensive, so the estimator errs cautious.
- System prompt counts. All system prompt content (including plugin-injected memory wrapped in
<plugin:NAME>blocks) is counted. - Tool calls and results. Counted as part of the message they belong to. Tool-result bodies can be large; the estimator includes them in full.
reservedOutputTokensdefault. 4096. The harness can override per-call based on the operator's expected response length (a one-shot summarizer call might reserve 1500; a long-form coding session might reserve 8000).
fraction is the operative number. The harness compares fraction against the agent's configured upper threshold (default 0.85 per ADR-0024). When fraction >= upper_threshold, the harness triggers compaction.
contextWindow: null handling. When the resolved model is unknown to the metadata catalog (local Ollama model, brand-new release not yet in the bundled snapshot), modelMeta is null and contextWindow is null. The harness falls back to a conservative default (32k tokens) and emits a warning per call. Operators with unknown models should add metadata via local-config overrides (future work; not v0).
Example use from harness
import { estimateTokens, lookupModelMeta } from "@kaged/llm";
const modelMeta = lookupModelMeta(route.provider, route.model);
const estimate = estimateTokens({
messages: candidateList,
systemPrompt: assembledSystemPrompt,
modelMeta,
reservedOutputTokens: agentConfig.compaction?.reservedOutputTokens ?? 4096,
});
if (estimate.contextWindow !== null && estimate.fraction >= agentConfig.compaction.upper_threshold) {
await runCompactionPipeline({ /* ... */ });
}
Performance
- Tiktoken path: ~1-3ms for a typical session message list (50-200 messages).
- Fallback path: ~0.1ms (string-length arithmetic).
Estimation runs before every LLM call. The cost is acceptable; it does not dominate latency.
tokenizer field on ModelMeta
ModelMeta is extended with an optional tokenizer: "tiktoken" | "gemini" | "llama" | "unknown" field. The bundled LiteLLM snapshot is augmented with this field where it can be determined from the LiteLLM data; unknown defaults to "unknown" and the estimator uses the fallback path.
Out of scope
- Exact token counting via the provider's native tokenizer API. v0 estimates locally. Some providers offer a
/tokenizeendpoint; the harness may use these in the reactive-fallback path in a future amendment. - Per-prompt warm-cache of token counts. The estimator is stateless; computing the same message list twice does the work twice. A future cache keyed on message content hashes is plausible if profiling shows it matters.
Provider usage reporting
@kaged/llm exposes a usage reporting interface for querying provider quota and consumption data. The daemon calls these fetchers and relays the results to the UI for budget dashboards and routing/backoff decisions.
Types
All usage types are kaged's own. They provide a normalized schema for representing provider quota limits, consumption windows, and budget status.
type UsageUnit = "percent" | "tokens" | "requests" | "usd" | "minutes" | "bytes" | "unknown";
type UsageStatus = "ok" | "warning" | "exhausted" | "unknown";
interface UsageWindow {
id: string; // stable identifier (e.g. "quota", "5h", "7d")
label: string; // human label (e.g. "Quota", "5 Hour", "7 Day")
durationMs?: number; // window duration when known
resetsAt?: number; // absolute reset timestamp (ms since epoch)
}
interface UsageAmount {
used?: number;
limit?: number;
remaining?: number;
usedFraction?: number; // 0..1
remainingFraction?: number; // 0..1
unit: UsageUnit;
}
interface UsageScope {
provider: string;
accountId?: string;
projectId?: string;
modelId?: string;
tier?: string;
windowId?: string;
shared?: boolean; // quota shared across models in the provider
}
interface UsageLimit {
id: string; // unique per limit entry (e.g. "zai:tokens", "gemini-3-flash:free:default")
label: string; // display label
scope: UsageScope;
window?: UsageWindow;
amount: UsageAmount;
status?: UsageStatus;
}
interface UsageReport {
provider: string;
fetchedAt: number; // epoch ms
limits: UsageLimit[];
metadata?: Record<string, unknown>;
}
interface UsageFetchOptions {
apiKey?: string; // for API-key-authenticated providers (zai)
accessToken?: string; // for OAuth providers (antigravity)
projectId?: string; // required by some OAuth providers
baseUrl?: string;
signal?: AbortSignal;
}
UsageReport is the primary shape the UI consumes. A single report contains multiple UsageLimit entries — one per quota window or limit type the provider exposes. The UI renders these as a budget dashboard: progress bars from usedFraction, status badges from status, reset countdowns from window.resetsAt.
Fetchers
Each provider with a quota endpoint gets a dedicated fetcher function. Fetchers are async, return null on failure (non-ok response, invalid payload, missing credentials), and never throw — the daemon handles the null gracefully.
| Fetcher | Provider | Auth | Endpoint |
|---|---|---|---|
fetchZaiUsage(options) |
zai |
apiKey |
GET /api/monitor/usage/quota/limit on https://api.z.ai |
fetchAntigravityUsage(options) |
antigravity |
accessToken + projectId |
POST /v1internal:fetchAvailableModels on https://cloudcode-pa.googleapis.com |
fetchZaiUsage
Fetches the Z.AI coding plan quota. The API returns TOKENS_LIMIT and TIME_LIMIT entries, each with usage (limit), currentValue (used), percentage, remaining, and nextResetTime. The fetcher maps these to two UsageLimit entries:
zai:tokens— token consumption quota, unit"tokens".zai:requests— request/time quota, unit"requests".
Both share the same UsageWindow (id: "quota") with resetsAt from the API's nextResetTime. Status classification: exhausted when usedFraction >= 1, warning when >= 0.9, ok otherwise.
The base URL is extracted from the baseUrl origin (strips the /api/anthropic path used for the LLM endpoint). Authorization header carries the raw API key (no Bearer prefix — Z.AI's monitor endpoint expects the key directly).
fetchAntigravityUsage
Fetches per-model quota from Antigravity's fetchAvailableModels endpoint. The response contains a models map where each model has quotaInfo / quotaInfos / quotaInfoByTier — each with remainingFraction (0..1) and resetTime.
The fetcher normalizes the nested quota info into flat UsageLimit entries:
- Per-model, per-tier, per-window. ID format:
{modelId}:{tier}:{windowId}. Label includes display name and tier when present. unit: "percent". Antigravity reports fractions, not absolute token counts.usedandremainingare expressed as percentages (0–100).statusclassification.exhaustedwhenremainingFraction <= 0,warningwhen<= 0.1,okotherwise.earliestReset. The report'smetadataincludes the earliest reset time across all limits, so the UI can show a single countdown.
Requires accessToken and projectId in UsageFetchOptions. Returns null if either is missing or the request fails.
Adding new fetchers
To add a fetcher for a new provider:
- Create
packages/llm/src/usage/<provider>.tswith an exportedasync function fetch<Provider>Usage(options: UsageFetchOptions): Promise<UsageReport | null>. - Export from
packages/llm/src/index.ts. - Add a row to the fetcher table above.
- The daemon wires it to the appropriate credential source and polling interval.
What this is NOT
- Not a local accumulator. These fetchers query the provider's own quota API. They don't track usage locally by counting tokens in SQLite — that's a separate concern the daemon may add later for providers without quota endpoints.
- Not polled automatically. The daemon decides when to fetch (on session start, on a timer, after a 429). The fetcher is a pure query function.
- Not a routing gate. The daemon/harness may use usage data to inform fallback decisions, but the fetcher itself doesn't block or reroute calls.
Provider adapter contract
Each API shape is implemented by a single adapter function:
type ProviderStreamFn = (
route: ProviderRoute,
context: Context,
options: StreamOptions,
) => LlmEventStream;
Four adapters ship in v0:
anthropic-messages—POST /v1/messageswithstream: true. SSE events:message_start,content_block_start,content_block_delta,content_block_stop,message_delta,message_stop.openai-completions—POST /v1/chat/completionswithstream: true. SSEdata:lines withchoices[0].delta.openai-responses—POST /v1/responseswithstream: true. SSE events:response.created,response.output_item.added,response.content_part.delta,response.output_item.done,response.completed. Known gap (tech debt): the genericopenai-responsesadapter is a stub that requests reasoning (body.reasoning = { effort }) but does not parse the model's reasoning output — it has no handling forresponse.reasoning_summary_text.*events orreasoningoutput items, so thinking from a reasoning model routed through this adapter is silently dropped (never enterspartial.content, neither live nor persisted). Theopenai-codex-responsesadapter (item 6) implements reasoning capture correctly and is the reference for closing this gap. See STATUS.md § Known tech debt.google-generative-ai—POST /v1beta/models/{model}:streamGenerateContent?alt=sse. SSEdata:lines with JSON chunks.
A fifth adapter ships alongside v0 for Antigravity (Google Cloud Code proxy):
antigravity—POST /v1internal:streamGenerateContent?alt=sse. Model ID in body (not path). Request wrapped in Antigravity envelope ({ model, request: { ...googleGenAiBody } }). Bearer token auth (OAuth). Antigravity-specific headers (User-Agent,X-Goog-Api-Client,Client-Metadata). SSE wire format identical togoogle-generative-ai. Rate-limit-aware error handling: extractsRateLimitInfofrom 429 responses (headers, structured error body, Go-style duration parsing). Usage extracted from every SSE frame for real-time consumption tracking. Thinking budget differs by model family (Claude vs Gemini). Tool declarations stay in AntigravityfunctionDeclarationsform for all model families, but schema normalization is model-family-specific: Gemini schemas are reduced to the GoogleSchemaproto subset, while Claude schemas use the battle-tested Antigravity JSON Schema subset fromreference/opencode-antigravity-auth(unsupported constraints become description hints, invalid unions are flattened, empty object parameters receive a placeholder property).openai-codex-responses—POST /codex/responsesonhttps://chatgpt.com/backend-apiwithstream: true. OAuth via ChatGPT accounts (auth.openai.com, client IDapp_EMoamEEZ73f0CkXaXp7hrann, PKCE + device-code flows). The access token is a JWT; thechatgpt_account_idclaim is extracted and sent as the mandatorychatgpt-account-idheader on every request. Usesinputarray (notmessages) with OpenAI Responses API event types. Custom headers:OpenAI-Beta: responses=experimental,originator,session_id. Supportsreasoningconfig (effort levelsnone/minimal/low/medium/high/xhigh+ summary mode). Rate limits extracted from response headers (x-codex-primary-used-percent,x-codex-primary-window-minutes,x-codex-primary-reset-at, and secondary equivalents). Error parsing decodesusage_limit_reached,rate_limit_exceededwith friendly messages and reset times. Tool calls disable parallel execution (parallel_tool_calls: false). Session-based prompt caching viaprompt_cache_key+prompt_cache_retention.
Each adapter:
- Builds the provider-specific request body from
Context+StreamOptions. - Sends
fetch()withAccept: text/event-stream. - Parses the SSE response into
StreamEventvalues pushed onto theLlmEventStream. - Tracks
Usagefrom provider-reported token counts. - Maps provider stop reasons to kaged's
StopReason. - Handles
AbortSignalby aborting the underlyingfetch.
SSE parser
A shared SSE line parser handles the raw HTTP response body for all providers. It:
- Reads the
ReadableStream<Uint8Array>fromfetchresponse body. - Splits on
\n\nboundaries (SSE frame delimiter). - Extracts
event:,data:, andid:fields per frame. - Yields
{ event: string | null, data: string }objects. - Handles
[DONE]sentinel (OpenAI) and empty keepalive frames.
This is a rewrite of pi-ai's readSseEvents utility in pure TS with no external deps.
Partial-JSON parser
Tool call arguments stream incrementally. A partial-JSON parser provides best-effort parsing of incomplete JSON so the UI can show tool arguments as they arrive (e.g., file path before content is complete).
Behavior:
- Returns
{}when no valid JSON prefix exists. - Closes unclosed strings, arrays, and objects.
- Never throws — always returns the best parse possible.
- Final
toolcall_endevent uses standardJSON.parse(must succeed; error →StopReason: "error").
Error taxonomy
All provider errors are surfaced as StreamEvent with type: "error" and the error detail in error.errorMessage. The package does not throw exceptions for provider failures — errors are events.
| Error class | Cause | errorMessage contains |
|---|---|---|
auth_failed |
401/403 from provider | HTTP status + provider error body |
rate_limited |
429 from provider | HTTP status + Retry-After if present |
context_too_long |
400 with context-length signal | Provider's error message |
model_not_found |
404 or model-not-available | Provider + model ID |
provider_error |
500/502/503 from provider | HTTP status + body excerpt |
network_error |
DNS failure, connection refused, timeout | Error message from fetch |
aborted |
AbortSignal triggered |
"Request aborted" |
parse_error |
Malformed SSE or JSON from provider | Raw data excerpt |
Retry policy
v0 ships with no automatic retry. The caller (harness/daemon) decides whether to retry, switch providers via fallback chain, or surface the error to the operator. This keeps the package simple and the retry policy visible.
Auth model
Credentials arrive via ProviderRoute.apiKey (resolved by @kaged/harness from @kaged/local-config). The package never reads environment variables or config files directly.
API key resolution (in @kaged/local-config, not here)
The ProviderSchema in local config already has api_key and api_key_env. The resolution order:
api_key(literal value inlocal.toml— for dev/testing).api_key_env(env var name; harness readsBun.env[name]at resolve time).
OAuth providers (kaged's distinctive path)
A class of providers exists that Mastra and Vercel (publishers of @ai-sdk/<provider> packages) do not ship: OAuth-based access to consumer subscriptions (Claude Pro, ChatGPT Plus, etc.). The terms of service for programmatic use of those subscriptions are in a gray area that corporate vendors choose to avoid. kaged is operator-owned, self-hosted; whether to use an OAuth-backed personal subscription with kaged is a decision the operator makes about their own account and provider relationship.
Per ADR-0014, kaged makes this an explicit architectural slot. @kaged/llm is the provider layer Mastra calls into via the LanguageModelV2 shim — and @kaged/llm is operator-owned code that may ship OAuth provider adapters Mastra's ecosystem cannot. The operator owns the TOS choice; kaged provides the capability.
OAuth token storage extends ProviderSchema with optional OAuth fields. The provider's type field signals the variant:
[providers.claude-pro]
type = "anthropic-oauth" # signals the OAuth provider variant
[providers.claude-pro.oauth]
access_token = "..."
refresh_token = "..."
expires_at = 1716000000
token_url = "https://..."
client_id = "..."
The harness is responsible for refreshing expired tokens before passing apiKey to @kaged/llm. The LLM package never does OAuth flows — it receives a ready-to-use bearer token (or whatever credential shape the OAuth variant requires).
v0 status. v0 does not ship OAuth provider adapters. The schema extension and the architectural slot are documented here for forward compatibility. API keys are the only supported auth in v0. A follow-up ADR + amendment will land OAuth when scheduled — see local-config.md for the credential storage shape that will land alongside.
Mastra integration (LanguageModelV2 shim)
Per ADR-0014, @kaged/llm exposes a LanguageModelV2 factory that lets Mastra v1.x consume any provider this package supports.
Public API
import { kagedModel } from "@kaged/llm/mastra"; // separate entry point
import type { LanguageModelV2 } from "@ai-sdk/provider-v5";
function kagedModel(
route: ProviderRoute,
options?: StreamOptions,
): LanguageModelV2;
kagedModel(route) returns a Vercel-AI-SDK-shaped LanguageModelV2 whose doStream and doGenerate methods wrap streamModel / completeModel. The returned object is the value the harness passes to Mastra's Agent.model field.
Mapping at the boundary
The shim translates in both directions:
Inbound (Mastra → kaged):
LanguageModelV2CallOptions.prompt(Vercel's message array) → kagedContext.messagesLanguageModelV2CallOptions.tools→ kagedContext.tools(Mastra's tool shape is JSON-Schema-aligned with kaged's)LanguageModelV2CallOptions.abortSignal→ kagedStreamOptions.signalLanguageModelV2CallOptions.temperature/maxTokens/topP/stopSequences/toolChoice→ correspondingStreamOptionsfieldsLanguageModelV2CallOptions.headers→ merged into kaged's outbound request headers
Outbound (kaged → Mastra):
- kaged
StreamEvent.text_delta→ Vercel{ type: "text-delta", textDelta } - kaged
StreamEvent.thinking_delta→ Vercel{ type: "reasoning", textDelta } - kaged
StreamEvent.toolcall_end→ Vercel{ type: "tool-call", toolCallId, toolName, args } - kaged
StreamEvent.done→ Vercel{ type: "finish", finishReason, usage } - kaged
StreamEvent.error→ Vercel{ type: "error", error }(or terminal finish with error reason, depending on what Mastra's stream consumer expects)
The shim never throws. Provider errors that arrive as kaged error events become Vercel error parts; the LanguageModelV2 contract surfaces them to Mastra, which routes them through the Processor pipeline's error hooks.
Why a separate entry point (@kaged/llm/mastra)
Mastra is optional at the @kaged/llm boundary. The shim lives behind a separate entry point so a downstream consumer that only wants the raw streamModel API doesn't pay for the @ai-sdk/provider-v5 type import. The main @kaged/llm exports have no Vercel-shaped types.
@kaged/harness imports @kaged/llm/mastra. The daemon's provider test endpoint imports the main @kaged/llm.
Integration with harness and daemon
Per ADR-0014 and agent.md, @kaged/llm is consumed two ways. The same provider adapters run in both:
Primary path — Mastra agent loop (via the LanguageModelV2 shim)
daemon (handlePostMessage)
→ harness (runPrimary)
→ harness (routeModel → ProviderRoute)
→ harness (kagedModel(route) → LanguageModelV2)
→ Mastra (new Agent({ model: ... }).stream(messages))
→ Mastra calls LanguageModelV2.doStream(opts)
→ shim translates → @kaged/llm.streamModel(route, context, options)
→ @kaged/llm returns LlmEventStream
→ shim translates StreamEvent → LanguageModelV2StreamPart
→ Mastra emits ChunkType on fullStream
→ harness maps ChunkType → WsFrame
→ daemon publishes WsFrame to session subscribers
The agent loop, tool dispatch, supervisor / sub-agent topology, Processor pipeline, and suspend / resume checkpoints are all Mastra's responsibility. @kaged/llm is the provider layer Mastra calls into.
Direct path — provider test, ad-hoc calls (no agent loop)
daemon (handleTestProvider, etc.)
→ @kaged/llm.completeModel(route, context, options)
→ returns AssistantMessage
The provider test endpoint and any future "I just want to ping the provider" call path uses completeModel / streamModel directly. Same code, no Mastra involvement.
Why the same code in both paths
Per ADR-0014, kaged maintains one provider implementation, not two. The shim is a translation layer; the underlying HTTP + SSE work happens in the same provider adapter (packages/llm/src/providers/*) regardless of which path called it. Custom headers, retry policy, OAuth refresh, telemetry hooks — all live in @kaged/llm and apply to both paths.
Package structure
packages/llm/
package.json
tsconfig.json
src/
index.ts # public API: streamModel, completeModel, types re-export
mastra.ts # separate entry point: kagedModel (LanguageModelV2 shim)
types.ts # Message, Context, Tool, StreamEvent, Usage, etc.
stream.ts # LlmEventStream class (AsyncIterable + result())
provider-map.ts # providerName → API shape mapping
models.ts # listModels, humanizeModelId — model discovery
model-meta.ts # ModelMeta type, lookupModelMeta, calculateCost
dispatch.ts # streamModel/completeModel — resolves shape, delegates
mastra-model.ts # the LanguageModelV2 shim implementation (re-exported via mastra.ts)
usage-types.ts # UsageReport, UsageLimit, UsageWindow, UsageFetchOptions, etc.
sse-parser.ts # shared SSE line parser
partial-json.ts # best-effort incomplete JSON parser
data/
litellm-pricing.json # bundled LiteLLM snapshot (updated at release time)
providers/
anthropic.ts # anthropic-messages adapter
openai-completions.ts # openai chat completions adapter
openai-responses.ts # openai responses API adapter
google.ts # google generative AI adapter
antigravity.ts # antigravity (Google Cloud Code proxy) adapter
codex/
constants.ts # Codex URLs, headers, JWT claim extraction
request-transformer.ts # kaged messages → Codex input array format
response-handler.ts # rate limit parsing, error formatting
usage/
zai.ts # Z.AI coding plan quota fetcher
antigravity.ts # Antigravity per-model quota fetcher
__tests__/
types.test.ts
sse-parser.test.ts
partial-json.test.ts
dispatch.test.ts
model-meta.test.ts # lookupModelMeta, calculateCost, key normalization
mastra-model.test.ts # shim translation tests
providers/
anthropic.test.ts
openai-completions.test.ts
openai-responses.test.ts
google.test.ts
antigravity.test.ts # antigravity adapter tests (text, tools, thinking, rate-limit, usage)
codex/
constants.test.ts # JWT extraction, URL constants
request-transformer.test.ts # message → Codex input transform
response-handler.test.ts # error parsing, rate limit extraction
Testing notes
- Unit tests for SSE parser: feed raw byte chunks, assert parsed frames.
- Unit tests for partial-JSON parser: incomplete objects, arrays, strings, nested structures.
- Unit tests for each provider adapter: mock
fetchto return canned SSE streams, assertStreamEventsequence and finalAssistantMessageshape. - Integration tests for
streamModeldispatch: mockfetch, verify correct provider adapter is selected by provider name. - Error tests: mock
fetchreturning 401, 429, 500, network errors, malformed SSE; assert correct error events. - Abort tests: fire
AbortController.abort()mid-stream, assertabortedevent and partialAssistantMessage. - Mastra shim tests (
mastra-model.test.ts): construct aLanguageModelV2viakagedModel(route). Feed it cannedLanguageModelV2CallOptions. Mock the underlyingstreamModelcall. Assert: (a) inbound mapping translates Vercel-shape options into kagedContext+StreamOptionscorrectly; (b) outbound mapping translates kagedStreamEvents into Vercel-shapeLanguageModelV2StreamParts; (c) errors arriving as kagederrorevents surface as Vercel error parts without throwing; (d)abortSignalpassed in is forwarded intoStreamOptions.signal. - Model metadata tests (
model-meta.test.ts):lookupModelMetakey normalization. AssertlookupModelMeta("anthropic", "claude-sonnet-4-20250514")returns the correctModelMetafrom the bundled catalog. AssertlookupModelMeta("google", "gemini-2.0-flash")maps to the"gemini/"prefix correctly.- Unknown model. Assert
lookupModelMeta("anthropic", "nonexistent-model")returnsnull. - Capability extraction. Assert a known reasoning model (e.g.
claude-3-7-sonnet) hascapabilities.reasoning === true. Assert a non-reasoning model hascapabilities.reasoning === false. - Pricing extraction. Assert
pricing.input,pricing.outputare non-zero for a known paid model. Assertpricing.reasoningis non-null for models withoutput_cost_per_reasoning_token. calculateCostwith metadata. Provide token counts and aModelMeta. Assert the cost breakdown matches expected values (input tokens × input price, output tokens × output price, reasoning tokens × reasoning price).calculateCostwithout metadata. Passmeta: null. Assert all costs are0.calculateCostreasoning fallback. For a model wherepricing.reasoningisnull, assert reasoning tokens are priced atpricing.output.
All tests use bun:test. No live provider calls in unit tests. Integration tests against real providers are manual/operator-initiated (not in CI).
Open questions
None. All four architectural questions resolved:
- ✅ OAuth stored in local config by extending
ProviderSchema; OAuth providers are an architectural slot per ADR-0014. v0 ships API keys only. - ✅ Streaming: dual call path — primary via Mastra (
LanguageModelV2shim), direct viastreamModel/completeModel. Same provider adapters in both. See § Integration with harness and daemon, and ADR-0014. - ✅ Package scope: pure provider interface, not general-purpose.
- ✅ Tool surface: yes, pi-ai's
Toolshape (JSON Schema parameters).
Amendments
2026-06-03 — openai-responses reasoning capture noted as a known gap
openai-responsesadapter documented as a reasoning stub. The v0 adapter list now records that the genericopenai-responsesadapter requests reasoning but never parses reasoning output events (response.reasoning_summary_text.*/reasoningitems), so reasoning content is dropped for any reasoning model routed through it. This is captured as tech debt in STATUS.md; theopenai-codex-responsesadapter is the reference implementation for closing it. No code change accompanies this amendment — it documents existing behavior surfaced during the reasoning-ordering fixes across the streaming adapters.
2026-05-31 — GitHub Copilot driver + device-code OAuth flow
- New
copilotdriver added. GitHub Copilot joins the driver catalog as anopenai-completionsprovider with default base URLhttps://api.githubcopilot.com, auth modeoauth, and default test modelgpt-4o. - Device-code OAuth flow added.
@kaged/llm/oauthnow supports providers that authenticate via device code instead of PKCE browser redirect.ProviderOAuthConfiggains optionaldeviceCodeconfiguration and login start results can returnuserCode/verificationUrifor daemon→UI relay. - Copilot-specific request headers documented. Requests may add
X-Initiator,Copilot-Vision-Request, andOpenai-Intentdynamically based on the conversation and image input. - Enterprise URL resolution documented. Copilot keeps the public default host (
api.githubcopilot.com) but can derive enterprise hosts ashttps://copilot-api.{ghe-domain}when the operator authenticated against GitHub Enterprise. - Post-login model policy activation added. Copilot runs a best-effort post-login hook that enables known models requiring policy acceptance before first use.
- Token exchange behavior documented. Login stores the long-lived GitHub OAuth token. At request time, the OpenAI-compatible adapter exchanges it against
GET https://api.github.com/copilot_internal/v2/tokento obtain the short-lived Copilot API bearer token. - Catalog tables updated. Both the API-shape resolution table and the v0 driver catalog now include
copilot. - Package structure updated. Added Copilot provider constants/helpers (
src/providers/copilot/) plus device-code OAuth flow support insrc/oauth/device-code.tsand matching tests under__tests__/oauth/and__tests__/providers/copilot/.
2026-05-27 — ADR-0024: estimateTokens API for pre-call compaction threshold check
Per ADR-0024:
- New § Token estimation added (under § ModelMeta). Documents
estimateTokens()— the function the harness calls before every LLM call to compute the current token usage and compare against the agent's configured compaction threshold. EstimateInputandEstimateResulttypes defined. Inputs: messages, system prompt, model metadata, optional reserved output budget. Outputs: input tokens, reserved output tokens, total, fraction-of-context-window, context window size, algorithm used.- Algorithm selection. Tiktoken when the model uses an OpenAI-compatible tokenizer; character-count fallback otherwise. The estimator over-estimates rather than under-estimates (reactive fallback in the harness catches the wrong-direction cases).
tokenizerfield added toModelMeta. Values:"tiktoken" | "gemini" | "llama" | "unknown". Used by the estimator to choose the algorithm; populated from the bundled LiteLLM snapshot where determinable, defaults to"unknown".- Unknown-model handling. When
lookupModelMetareturns null (model not in catalog), the estimator falls back to a 32k conservative default forcontextWindowand emits a per-call warning. Operators with unknown models will eventually have a local-config override path (future work; not v0). - Constrained-by list extended with ADR-0024.
2026-05-23 — LanguageModelV2 shim + dual call-path + OAuth provider role
Driven by ADR-0014:
LanguageModelV2shim added (new § Mastra integration).kagedModel(route)is a factory exported from@kaged/llm/mastrathat returns a Vercel-AI-SDK-shapedLanguageModelV2. Mastra v1.x uses this asAgent.model. The shim is the only Mastra-aware code in@kaged/llm.- Integration with harness rewritten. Replaced the previous single-path call chain with the dual-path description: primary path (agent loop via Mastra) and direct path (provider test, ad-hoc calls). Same provider adapters run in both.
- OAuth providers section rewritten. Was "OAuth (future)" with a forward-compat note. Now "OAuth providers (kaged's distinctive path)" —
@kaged/llm's ability to ship OAuth / subscription adapters Mastra / Vercel won't is the reason the dual-path architecture exists, not an afterthought. Operator-owns-TOS-choice stance documented. v0 status unchanged: API keys only ship in v0. - Package structure updated. Added
mastra.ts(entry point) andmastra-model.ts(implementation) to thesrc/listing, plusmastra-model.test.tsto__tests__/. - Constraint table + Constrained-by list updated. New row pointing at ADR-0014.
Constrained bylist now includes ADR-0012 and ADR-0014. - Open questions cross-referenced. Items #1 and #2 cite ADR-0014 for the now-concrete resolutions.
2026-05-23 — Driver catalog spec
- New § Driver catalog added (under § API shape resolution). Documents
DriverInfo,DriverAuthMode,listDrivers(), and the full v0 driver table with labels, auth modes, base URLs, local flags, and test models. - UI integration contract documented. Specifies how the daemon relays
known_drivers: DriverInfo[]and how the UI consumes it (driver select rendering, base URL pre-fill, conditional credentials, contextual badges). No driver metadata hardcoded in the UI.
2026-05-23 — Model discovery functions
- New § Model discovery added (under § Driver catalog). Documents
listModels()andhumanizeModelId()— the two functions the daemon's model catalog endpoints consume. listModels()fetches live model lists from provider APIs. Covers all four API shapes with per-shape extraction logic (OpenAI/v1/models, Anthropic paginated/v1/models, Google paginated/v1beta/modelswithgenerateContentfilter). Returns{ ok, models, error? }— never throws.humanizeModelId()generates display names from model IDs (hyphen/underscore → space, title-case). Used as fallback when operators haven't set an explicitnamein their model catalog.- "Not normative for" list updated. Replaced the stale "deferred" model catalogs note with accurate scope: this package provides discovery functions; persistence is
@kaged/local-config's responsibility.
2026-05-24 — Model metadata catalog + pricing
Driven by streaming-first enrichment work (provider:model labels, post-message stats bar, cost tracking in UI):
Usage.cost.reasoningfield added. Thecostobject insideUsagegains areasoning: numberfield to separately track the dollar cost of reasoning/thinking tokens. Previously, reasoning tokens were silently lumped into theoutputcost; now callers can display them distinctly.- New § Model metadata catalog added (under § Model discovery). Defines:
ModelMeta— the extracted metadata type for a single model (context limits, pricing per-token, capability flags, deprecation date). Sourced from LiteLLM's community-maintained JSON;@kaged/llmships a bundled snapshot updated at release time.lookupModelMeta(provider, modelId)— maps kaged's"provider:model"convention to LiteLLM keys and returnsModelMeta | null. Missing metadata is never fatal.calculateCost(usage, meta)— pure utility computing dollar cost from token counts and pricing metadata. Falls back to output pricing for reasoning tokens whenpricing.reasoningis null. Returns all-zero when metadata is unavailable.- Operator overrides — local config can override pricing and capabilities per model, taking precedence over the bundled catalog.
- Package structure updated. Added
model-meta.ts(types + functions),data/litellm-pricing.json(bundled snapshot), andmodel-meta.test.ts. - Testing notes updated. Added model metadata test cases: key normalization, unknown model, capability extraction, pricing extraction,
calculateCostwith/without metadata, reasoning price fallback.
2026-05-25 — Antigravity provider adapter
Adds Antigravity (Google Cloud Code proxy) as the fifth API shape and provider adapter:
- New
antigravityAPI shape added.ApiShapeunion extended. Antigravity gets its own shape rather than reusinggoogle-generative-ai— URL structure (/v1internal:streamGenerateContent, model in body not path), auth mechanism (Bearer token / OAuth, not API key), request envelope ({ model, request: { ...innerBody } }), and rate-limit semantics all differ. streamAntigravityadapter implemented (providers/antigravity.ts). Full streaming adapter with: Antigravity envelope wrapping, Bearer token auth, Antigravity-specific headers (User-Agent,X-Goog-Api-Client,Client-Metadata), line-based SSE parsing (Antigravity-specific wire format; see 2026-06-03 amendment), thinking/text/toolCall streaming, per-frame usage extraction, abort handling, and rate-limit-aware 429 error handling.RateLimitInfotype exported. Structured rate-limit info (retryAfterMs,reason,quotaResetTime,message) extracted from 429 responses. Parses Go-style compound durations (1h16m0.667s),retry-after-ms/retry-afterheaders, and structured error body details (RetryInfo,ErrorInfo,QuotaFailure). Surfaced inerrorMessagefor upstream rotation/backoff decisions.- Thinking budget differs by model family. Claude models (detected by
modelId.includes("claude")) use different budget ranges than Gemini models — Claude omitsincludeThoughtsand uses higher budgets at each effort level. - Thinking blocks stripped from outgoing requests. Assistant message history omits thinking blocks when building
contents— Antigravity generates fresh thinking each turn (matching the reference plugin's approach). If stripping leaves a history turn with no valid parts, the adapter omits that turn instead of sending an emptycontentsentry. listModelsreturns informational error forantigravity. Antigravity does not expose a model listing endpoint; models must be configured in the project DSL.- Driver catalog updated. New
antigravityentry: label "Antigravity", base URLhttps://cloudcode-pa.googleapis.com, auth modeoauth, test modelgemini-2.5-flash. - Dispatch wired.
dispatch.tsroutesantigravityshape tostreamAntigravity. - Package structure updated. Added
providers/antigravity.tsand__tests__/providers/antigravity.test.ts. - 31 new tests. Text streaming, per-frame usage tracking, tool calls, thinking, rate-limit extraction (structured body, message fallback, header), safety filter, request format (URL, Bearer auth, envelope, headers, thinking budgets per model family, thinking block stripping).
2026-06-03 — Antigravity SSE and history sanitation
- Antigravity uses a line-based SSE parser, not the shared
parseSseStream. Antigravity's wire format sendsdata: {json}\nlines where each line is a complete event — it does not use\n\ndouble-newline framing like standard SSE. The adapter uses a localparseAntigravityStreamgenerator that splits on\n, processes eachdata:-prefixed line individually, and buffers partial lines across network chunks. This matches the battle-tested OpenCode reference plugin (createStreamingTransformerinreference/opencode-antigravity-auth/). The sharedparseSseStream(which waits for\n\nboundaries) must NOT be used for Antigravity — doing so causes silent output loss. - Antigravity strips empty sanitized history turns. During
contentsconstruction, empty text parts are omitted. If a user or assistant history message has no remaining valid parts after provider-specific sanitation (for example, an assistant turn containing only stripped thinking), the adapter omits the entire history turn rather than sending{ parts: [] }. - Testing notes updated. Antigravity provider tests cover split/coalesced SSE frames carrying thinking/text output and empty sanitized history turns.
2026-05-30 — ADR-0028: OAuth provider module
Per ADR-0028:
- OAuth provider module added (
src/oauth/). Generic framework for any 3rd-party OAuth-backed LLM provider. 12 modules:types.ts(ProviderOAuthConfig,ProviderTokens),pkce.ts(PKCE viacrypto.subtle),token-store.ts(per-provider JSON at$XDG_CONFIG_HOME/kaged/oauth/<provider>-tokens.json),authorize.ts(auth URL construction from config),callback-server.ts(temporaryBun.serve()),token-exchange.ts(code exchange with post-login hook support),refresh.ts(proactive + reactive refresh,resolveOAuthCredentials),login.ts,logout.ts,status.ts,index.ts(barrel). - Driver catalog extended.
ProviderOAuthConfigdeclarations inPROVIDER_OAUTH_CONFIGSalongside existingDRIVER_AUTH_MODES.DriverInfogains optionaloauthfield.resolveOAuthConfig(driverName)export. - Antigravity config registered. Full Google OAuth config with post-login hook for project ID resolution (fetchProjectId + onboardUser logic migrated from daemon).
- Package export added.
@kaged/llm/oauthentry point for daemon consumption. - Constrained-by list extended with ADR-0028.
- Package structure updated. Added
src/oauth/(12 files) and__tests__/oauth/(3 test files).
2026-05-30 — ADR-0026: Model metadata overrides + cost management + usage pipeline
Per ADR-0026:
- § Operator overrides rewritten (under § Model metadata catalog). Replaced the local.toml override description with the full DB-backed override system:
ModelOverridestorage shape, sparse key-value schema, overridable fields table with dot-notation for nested fields,resolveModelMetafunction withResolvedModelMetareturn type (merged result + per-field source tracking), examples for self-hosted models and stale-pricing corrections. resolveModelMetareplaceslookupModelMetafor override-aware callers. The harness, compaction, and token estimator callresolveModelMetainstead oflookupModelMeta. The merge path is: LiteLLM default → apply overrides → return. Models not in LiteLLM get a syntheticModelMetabuilt from overrides only.- Context window overrides feed compaction. Per the ADR-0024 amendment,
maxInputTokensis overridable. The compaction system uses the effective (merged) context window for threshold calculation. - Constrained-by list extended with ADR-0026.
- Provider usage pipeline documented (§ Provider usage reporting extended). On-demand fetch with DB cache (
provider_usage_cachetable), cache invalidated after every LLM call to that provider, manual refresh endpoint for out-of-band usage. - Cost accumulation per provider. New
provider_spend_eventstable records cost per LLM call. Daemon sums events per rolling window (5h, 7d) and compares against configured limits before each call. - Spend limit enforcement.
provider_spend_limitstable stores per-provider limits (absolute USD per window, percentage of rolling window for quota-based providers). Enforcement is a hard block before LLM dispatch — not a soft warning.
2026-05-29 — Z.AI driver + provider usage reporting
- New
zaidriver added. Maps toanthropic-messagesAPI shape with base URLhttps://api.z.ai/api/anthropic. Label "Z.AI (GLM Coding Plan)". Test modelglm-5.1. Auth modeapi_key. The existingstreamAnthropicadapter handles non-Anthropic base URLs by switching fromX-Api-KeytoAuthorization: Bearer— no new provider adapter needed. - New § Provider usage reporting added (under § Driver catalog). Defines the normalized types (
UsageReport,UsageLimit,UsageWindow,UsageAmount,UsageScope,UsageFetchOptions) and two provider-specific fetchers:fetchZaiUsage— queriesGET /api/monitor/usage/quota/limitonhttps://api.z.ai. ParsesTOKENS_LIMITandTIME_LIMITentries intoUsageLimitentries withusedFraction/remainingFraction, status classification (ok/warning/exhausted), andresetsAtfrom the API'snextResetTime.fetchAntigravityUsage— queriesPOST /v1internal:fetchAvailableModelson Antigravity's endpoint. Parses per-model, per-tier quota info (fraction-based) into flatUsageLimitentries. Requires OAuthaccessToken+projectId.
UsageFetchOptionssupports both auth patterns.apiKeyfor API-key providers (zai),accessToken+projectIdfor OAuth providers (antigravity). Each fetcher validates its own requirements at the top and returnsnullif missing.- API shape resolution table updated. New
zairow. - Driver catalog table updated. New
zaientry with full metadata. - Package structure updated. Added
usage-types.ts,usage/zai.ts,usage/antigravity.ts.
References
- ADR-0004: Runtime — Bun + TypeScript
- ADR-0011: Project portability
- ADR-0012: Agentic substrate is Mastra v1.x
- ADR-0014: All LLM providers route through
@kaged/llm; Mastra integrates via aLanguageModelV2shim - ADR-0024: Context compaction is kaged-owned, layered, observable, and operator-tunable
- ADR-0026: Cost management, model metadata overrides, and provider usage tracking
- ADR-0028: 3rd-party OAuth provider auth
- Spec: Agent harness — Mastra integration,
LanguageModelV2consumer - Spec: Local config — credential storage
- Spec: HTTP API — WebSocket relay
- Spec: Session manager — run state machine
reference/oh-my-pi/packages/ai/— pi-ai reference implementation (wire protocol source)