Spec: LLM Provider Interface

Purpose

This spec defines @kaged/llm — the package that talks to LLM providers. It takes a resolved route (provider name, model ID, credentials) and a conversation context, and returns a stream of typed events representing the assistant's response.

This package is normative for:

  • The provider adapter interface and its contract.
  • The streaming event protocol (SSE parsing, event emission, partial-JSON tool arguments).
  • The six API shapes supported: anthropic-messages, openai-completions, openai-responses, openai-codex-responses, google-generative-ai, antigravity.
  • The message, tool, and context types that cross the package boundary.
  • The error taxonomy for provider failures.
  • AbortSignal integration for request cancellation.

It is not normative for:

  • Model alias resolution or fallback chains (that's provider-router in @kaged/harness).
  • Credential storage or operator-local config (that's @kaged/local-config).
  • Session state machines or run lifecycle (that's session-manager.md).
  • The WebSocket relay from daemon to UI (that's http-api.md).
  • Model catalog persistence or operator-local config (that's @kaged/local-config); this package provides the discovery functions (listModels, humanizeModelId) consumed by the daemon's catalog endpoints.

This package is a pure provider interface. It is not a general-purpose LLM framework. It exists to translate kaged's internal context into provider-specific HTTP requests and translate provider-specific SSE responses back into kaged's event stream.

Per ADR-0014, this package is also the single provider path for kaged. It exposes a LanguageModelV2 shim (see § Mastra integration) that Mastra v1.x consumes as Agent.model, so the agent loop and the direct call path both route through the same provider adapters. There is no parallel @ai-sdk/<provider> dependency tree.

Constraints (from ADRs)

Constraint Source
Runtime is Bun; no Node-isms in production code ADR-0004
No official SDKs (@anthropic-ai/sdk, openai); pure fetch-based Operator decision (see § Design rationale)
Projects are portable; provider credentials are operator-local ADR-0011
Single provider path for all LLM calls; expose a LanguageModelV2 shim for Mastra; no @ai-sdk/<provider> deps ADR-0014

Design rationale

pi-ai (the reference implementation in reference/oh-my-pi/packages/ai/) uses official SDKs for Anthropic and imports from @bufbuild/protobuf for Cursor. kaged deliberately avoids these:

  1. Fewer deps. Official SDKs pull in HTTP clients, retry logic, and polyfills we don't need — Bun's fetch handles everything.
  2. Smaller surface. We support 4 API shapes, not 11. The SDK abstraction layers add indirection we'd have to fight.
  3. ARM64 safety. No native deps that wobble on low-resource Linux hosts.

The pi-ai source is the reference for wire protocols (request/response shapes, SSE event formats, header requirements). We rewrite the transport in pure TS with fetch.


Types

All types are kaged's own. They are informed by pi-ai's shapes but are not imported from it.

Message types

interface TextContent {
  type: "text";
  text: string;
}

interface ThinkingContent {
  type: "thinking";
  thinking: string;
}

interface ImageContent {
  type: "image";
  data: string;        // base64
  mimeType: string;    // e.g. "image/png"
}

interface ToolCall {
  type: "toolCall";
  id: string;
  name: string;
  arguments: Record<string, unknown>;
}

interface UserMessage {
  role: "user";
  content: string | (TextContent | ImageContent)[];
  timestamp: number;
}

interface AssistantMessage {
  role: "assistant";
  content: (TextContent | ThinkingContent | ToolCall)[];
  provider: string;
  model: string;
  usage: Usage;
  stopReason: StopReason;
  errorMessage?: string;
  timestamp: number;
  duration?: number;
  ttft?: number;     // time to first token (ms)
}

interface ToolResultMessage {
  role: "toolResult";
  toolCallId: string;
  toolName: string;
  content: (TextContent | ImageContent)[];
  isError: boolean;
  timestamp: number;
}

interface SystemMessage {
  role: "system";
  content: string;
}

type Message = UserMessage | SystemMessage | AssistantMessage | ToolResultMessage;

Context

interface Tool {
  name: string;
  description: string;
  parameters: Record<string, unknown>;  // JSON Schema object
  strict?: boolean;
}

interface Context {
  systemPrompt?: string[];
  messages: Message[];
  tools?: Tool[];
}

Usage & Stop

interface Usage {
  input: number;
  output: number;
  cacheRead: number;
  cacheWrite: number;
  totalTokens: number;
  reasoningTokens?: number;
  cost: {
    input: number;
    output: number;
    reasoning: number;
    cacheRead: number;
    cacheWrite: number;
    total: number;
  };
}

type StopReason = "stop" | "length" | "toolUse" | "error" | "aborted";

Stream events

The event stream is an AsyncIterable<StreamEvent> with a .result() method that resolves to the final AssistantMessage.

type StreamEvent =
  | { type: "start"; partial: AssistantMessage }
  | { type: "text_start"; contentIndex: number; partial: AssistantMessage }
  | { type: "text_delta"; contentIndex: number; delta: string; partial: AssistantMessage }
  | { type: "text_end"; contentIndex: number; content: string; partial: AssistantMessage }
  | { type: "thinking_start"; contentIndex: number; partial: AssistantMessage }
  | { type: "thinking_delta"; contentIndex: number; delta: string; partial: AssistantMessage }
  | { type: "thinking_end"; contentIndex: number; content: string; partial: AssistantMessage }
  | { type: "toolcall_start"; contentIndex: number; partial: AssistantMessage }
  | { type: "toolcall_delta"; contentIndex: number; delta: string; partial: AssistantMessage }
  | { type: "toolcall_end"; contentIndex: number; toolCall: ToolCall; partial: AssistantMessage }
  | { type: "done"; reason: "stop" | "length" | "toolUse"; message: AssistantMessage }
  | { type: "error"; reason: "error" | "aborted"; error: AssistantMessage };

These events match pi-ai's AssistantMessageEvent shape exactly. This is intentional — the event protocol is battle-tested and the daemon's WebSocket relay can forward them without transformation.


Provider adapter interface

/** What the harness hands to @kaged/llm after route resolution. */
interface ProviderRoute {
  providerName: string;
  modelId: string;
  apiKey: string;
  baseUrl?: string;
  defaultOptions?: Record<string, unknown>;
}

/** Options for a stream request. */
interface StreamOptions {
  signal?: AbortSignal;
  temperature?: number;
  maxTokens?: number;
  topP?: number;
  stopSequences?: string[];
  reasoning?: EffortLevel;
  headers?: Record<string, string>;
}

type EffortLevel = "minimal" | "low" | "medium" | "high";

/** The main entry point. */
function streamModel(
  route: ProviderRoute,
  context: Context,
  options?: StreamOptions,
): LlmEventStream;

/** Convenience: await the full response. */
async function completeModel(
  route: ProviderRoute,
  context: Context,
  options?: StreamOptions,
): Promise<AssistantMessage>;

API shape resolution

The provider name determines which API shape to use. The mapping is configured per-provider, not per-model:

Provider name API shape Notes
anthropic anthropic-messages Direct Anthropic API
openai openai-completions Chat completions (v1/chat/completions)
google google-generative-ai Gemini generateContent
xai openai-completions OpenAI-compatible
groq openai-completions OpenAI-compatible
deepseek openai-completions OpenAI-compatible
mistral openai-completions OpenAI-compatible
ollama openai-completions OpenAI-compatible (/v1/chat/completions)
vllm openai-completions OpenAI-compatible
lm-studio openai-completions OpenAI-compatible
litellm openai-completions OpenAI-compatible
openrouter openai-completions OpenAI-compatible
antigravity antigravity Google Cloud Code proxy (Antigravity); OAuth bearer token auth
codex openai-codex-responses OpenAI Codex (ChatGPT backend); OAuth via auth.openai.com, PKCE flow; uses input array, not messages
copilot openai-completions GitHub Copilot (multi-provider via Copilot subscription); device-code OAuth via github.com; custom headers
zai anthropic-messages Z.AI (GLM Coding Plan); Anthropic-compatible proxy at api.z.ai

This mapping lives in a provider-map.ts file. Operators can extend it via local config (future; v0 ships with the hardcoded map above).

Driver catalog

@kaged/llm is the single source of truth for driver metadata. The daemon relays the full catalog to the UI via GET /api/v1/local/providers (as known_drivers); the UI renders provider config forms dynamically from it — no hardcoded driver lists in the frontend.

Types

/** Auth modes a driver supports. */
type DriverAuthMode = "api_key" | "oauth" | "none";

/** Full driver metadata — the shape the daemon relays to the UI. */
interface DriverInfo {
  /** Driver identifier (e.g. "anthropic", "ollama"). */
  name: string;
  /** Human-readable label (e.g. "Anthropic", "Ollama"). */
  label: string;
  /** Which API shape this driver speaks. */
  apiShape: ApiShape;
  /** Default base URL, if any. */
  defaultBaseUrl: string | undefined;
  /** Default test model for the provider test endpoint. */
  testModel: string;
  /** Whether this is a local/self-hosted driver. */
  local: boolean;
  /** Auth modes this driver supports, ordered by preference. */
  authModes: DriverAuthMode[];
}

Public functions

Function Returns Purpose
listDrivers() DriverInfo[] Full catalog with metadata for all known drivers
knownProviders() string[] Driver names only (legacy; use listDrivers for new code)
resolveApiShape(name) ApiShape | undefined API shape for a driver name
getDefaultBaseUrl(name) string | undefined Default base URL for a driver
getDriverTestModel(name) string Test model for provider connectivity checks

Driver catalog (v0)

Driver Label API shape Default base URL Local Auth modes Test model
anthropic Anthropic anthropic-messages https://api.anthropic.com no api_key claude-sonnet-4-20250514
openai OpenAI openai-completions https://api.openai.com no api_key gpt-4.1-mini
google Google (Gemini) google-generative-ai https://generativelanguage.googleapis.com no api_key gemini-2.0-flash
xai xAI (Grok) openai-completions https://api.x.ai no api_key grok-3-mini-fast
groq Groq openai-completions https://api.groq.com/openai no api_key llama-3.3-70b-versatile
deepseek DeepSeek openai-completions https://api.deepseek.com no api_key deepseek-chat
mistral Mistral openai-completions https://api.mistral.ai no api_key mistral-small-latest
fireworks Fireworks AI openai-completions https://api.fireworks.ai/inference no api_key
together Together AI openai-completions https://api.together.xyz no api_key
cerebras Cerebras openai-completions https://api.cerebras.ai no api_key
openrouter OpenRouter openai-completions https://openrouter.ai/api no api_key openai/gpt-4.1-mini
ollama Ollama openai-completions http://127.0.0.1:11434 yes none, api_key llama3.2
vllm vLLM openai-completions http://127.0.0.1:8000 yes none, api_key
lm-studio LM Studio openai-completions http://127.0.0.1:1234 yes none, api_key
litellm LiteLLM openai-completions http://localhost:4000 yes none, api_key
antigravity Antigravity antigravity https://cloudcode-pa.googleapis.com no oauth gemini-2.5-flash
codex OpenAI Codex openai-codex-responses https://chatgpt.com/backend-api no oauth gpt-5-codex
copilot GitHub Copilot openai-completions https://api.githubcopilot.com no oauth gpt-4o
zai Z.AI (GLM Coding Plan) anthropic-messages https://api.z.ai/api/anthropic no api_key glm-5.1

Local drivers list none first in authModes — the UI uses this to suppress the credentials section by default.

UI integration

The daemon's GET /api/v1/local/providers response includes known_drivers: DriverInfo[]. The UI consumes this to:

  1. Populate driver <select> with d.label as display text, d.name as value.
  2. Pre-fill base URL from d.defaultBaseUrl when the operator selects a driver.
  3. Conditionally render credentials — hidden when the driver's authModes does not include api_key.
  4. Show contextual badges — "local" for d.local === true drivers without a key, red "no key" warning for remote drivers missing credentials.

No driver metadata is hardcoded in the UI. Adding a new driver to provider-map.ts is sufficient — the UI picks it up automatically.

Model discovery

@kaged/llm provides two functions for model catalog workflows. The daemon's model catalog endpoints (http-api.md § Model catalog) consume these; persistence is handled by @kaged/local-config.

listModels(options)

Fetches the live model list from a provider's API. Supports all five API shapes:

  • OpenAI-compatible (openai-completions, openai-responses): GET /v1/models, extracts from data[] array.
  • Anthropic (anthropic-messages): GET /v1/models with pagination (after_id, up to 10 pages of 100).
  • Google (google-generative-ai): GET /v1beta/models with pagination (pageToken, up to 25 pages), filtered to models supporting generateContent. Strips models/ prefix from IDs.
  • Antigravity (antigravity): Returns { ok: false } with an informational error — Antigravity does not expose a model listing endpoint. Models must be configured in the project DSL.
interface ListModelsOptions {
  driver: string;
  apiKey: string;
  baseUrl?: string;
  signal?: AbortSignal;
}

interface ListModelsResult {
  ok: boolean;
  models: ModelInfo[];
  error?: string;
}

interface ModelInfo {
  id: string;
  name: string;
}

function listModels(options: ListModelsOptions): Promise<ListModelsResult>;

Returns { ok: false, error } on unknown driver, missing base URL, HTTP errors, or fetch failures. Never throws — all errors are captured in the result.

Models are sorted alphabetically by id. Names come from provider metadata when available (display_name for Anthropic, displayName for Google); fall back to the raw id.

humanizeModelId(id)

Generates a human-readable display name from a model identifier. Used as the fallback when no explicit name is stored in the operator's model catalog.

function humanizeModelId(id: string): string;

Behavior:

  • Replaces hyphens and underscores with spaces.
  • Collapses consecutive separators.
  • Title-cases each word (first letter uppercase).
  • Preserves dots and version numbers.

Examples:

Input Output
claude-sonnet-4-20250514 Claude Sonnet 4 20250514
gpt-4.1-mini Gpt 4.1 Mini
gemini-2.0-flash Gemini 2.0 Flash
deepseek-chat Deepseek Chat

Model metadata catalog

@kaged/llm is the single source of truth for model metadata — capabilities, pricing, and context limits. The daemon and UI consume this metadata via the shared types; they never parse pricing data themselves.

Data source

The base catalog is the LiteLLM model_prices_and_context_window.json — a community-maintained JSON file covering 1000+ models across all major providers. @kaged/llm ships a bundled snapshot of this file as the base catalog. The snapshot is updated periodically (at release time, not at runtime in v0).

The LiteLLM JSON keys are "provider/model-id" (e.g. "anthropic/claude-sonnet-4-20250514", "openai/gpt-4.1-mini"). Each entry contains a superset of the fields below; @kaged/llm extracts only the fields it needs.

ModelMeta

The extracted metadata type for a single model. This is the shape the daemon and UI consume — never the raw LiteLLM JSON.

interface ModelMeta {
  /** LiteLLM key (e.g. "anthropic/claude-sonnet-4-20250514"). */
  key: string;

  /** LiteLLM provider identifier (e.g. "anthropic", "openai", "vertex_ai"). */
  litellmProvider: string;

  /** Model mode — only "chat" models are relevant for kaged's agent loop. */
  mode: string;

  // --- Context limits ---
  maxInputTokens: number | null;
  maxOutputTokens: number | null;

  // --- Pricing (USD per token) ---
  pricing: {
    input: number;                     // input_cost_per_token
    output: number;                    // output_cost_per_token
    reasoning: number | null;          // output_cost_per_reasoning_token (null = same as output)
    cacheRead: number | null;          // cache_read_input_token_cost
    cacheWrite: number | null;         // cache_creation_input_token_cost
  };

  // --- Capabilities ---
  capabilities: {
    reasoning: boolean;                // supports_reasoning
    vision: boolean;                   // supports_vision
    functionCalling: boolean;          // supports_function_calling
    streaming: boolean;                // true for all chat models (kaged assumption)
    promptCaching: boolean;            // supports_prompt_caching
    responseSchema: boolean;           // supports_response_schema
    systemMessages: boolean;           // supports_system_messages
    webSearch: boolean;                // supports_web_search
    audioInput: boolean;               // supports_audio_input
    audioOutput: boolean;              // supports_audio_output
    pdf: boolean;                      // supports_pdf_input
  };

  // --- Deprecation ---
  deprecationDate: string | null;      // ISO 8601 date string or null
}

Fields not present in the LiteLLM entry default to null (limits, optional pricing) or false (capabilities). pricing.reasoning defaults to null, meaning the caller should fall back to pricing.output for reasoning tokens when computing cost.

Key normalization

LiteLLM keys use a "litellm_provider/model-id" format that doesn't always match kaged's "provider:model" convention (kaged uses : as separator, and provider names are kaged-defined — see § API shape resolution). The catalog provides a lookup function that accepts kaged's native format:

/** Look up model metadata by kaged's "provider:model" identifier. */
function lookupModelMeta(provider: string, modelId: string): ModelMeta | null;

The function maps kaged provider names to LiteLLM provider prefixes internally (e.g. "anthropic""anthropic/", "openai""openai/", "google""gemini/", "xai""xai/", "deepseek""deepseek/", "groq""groq/", "openrouter""openrouter/", etc.). If no match is found, returns null — the caller (harness, daemon) proceeds without metadata. Missing metadata is never a fatal error.

calculateCost

A pure utility that computes the dollar cost of a completed LLM call from token counts and pricing metadata.

interface CostInput {
  usage: Usage;
  meta: ModelMeta | null;
}

interface CostBreakdown {
  input: number;
  output: number;
  reasoning: number;
  cacheRead: number;
  cacheWrite: number;
  total: number;
}

function calculateCost(input: CostInput): CostBreakdown;

If meta is null (unknown model), all costs are 0. If meta.pricing.reasoning is null, reasoning tokens are priced at meta.pricing.output. The function is pure — no side effects, no network calls.

Operator overrides and the resolution pipeline

Per ADR-0026, operators can override any metadata field per provider+model. Overrides are stored in the DB (@kaged/storage, model_overrides table), not in local.toml. The resolution order:

  1. Operator override from DB (highest priority)
  2. Bundled LiteLLM snapshot (lowest priority)

This allows operators to correct stale pricing, fix wrong context windows, or add metadata for models not in the LiteLLM catalog (e.g. self-hosted fine-tunes behind Ollama/vLLM).

Override storage shape
interface ModelOverride {
  provider: string;        // kaged provider name (e.g. "anthropic", "ollama")
  modelId: string;         // model ID (e.g. "claude-sonnet-4-20250514", "my-llama3")
  field: string;           // field name from ModelMeta (e.g. "maxInputTokens", "pricing.input")
  value: string;           // JSON-encoded value (number, boolean, string, or null)
  updatedAt: number;       // epoch ms
}

Sparse — only overridden fields have rows. The primary key is (provider, modelId, field).

Overridable fields

All scalar fields on ModelMeta:

Field Type Notes
maxInputTokens number | null Context window for compaction thresholds
maxOutputTokens number | null Max output tokens
pricing.input number USD per input token
pricing.output number USD per output token
pricing.reasoning number | null USD per reasoning token
pricing.cacheRead number | null USD per cache read token
pricing.cacheWrite number | null USD per cache write token
capabilities.reasoning boolean
capabilities.vision boolean
capabilities.functionCalling boolean
capabilities.promptCaching boolean
capabilities.responseSchema boolean
capabilities.systemMessages boolean
capabilities.webSearch boolean
capabilities.audioInput boolean
capabilities.audioOutput boolean
capabilities.pdf boolean
deprecationDate string | null
tokenizer string "tiktoken" | "gemini" | "llama" | "unknown"

Nested fields use dot notation in the field column (e.g. pricing.input, capabilities.vision).

resolveModelMeta

The merge function. Replaces lookupModelMeta for callers that need override-aware metadata (harness, daemon, compaction).

interface ResolvedModelMeta {
  meta: ModelMeta;                         // the merged result
  sources: Record<string, "override" | "default">;  // per-field origin tracking
}

function resolveModelMeta(
  provider: string,
  modelId: string,
  overrides: ModelOverride[],
): ResolvedModelMeta;

Behavior:

  • Start with the LiteLLM default (lookupModelMeta). If no LiteLLM entry exists, start from a null-default ModelMeta (all fields null/false, key and litellmProvider synthesized from inputs).
  • For each override, apply the value to the corresponding field. Dot-notation fields set nested values (e.g. pricing.input sets meta.pricing.input).
  • The sources map tracks which fields came from overrides vs defaults, enabling the UI to render visual distinction.

Models not in LiteLLM. When lookupModelMeta returns null, the override system builds a ModelMeta entirely from overrides. Missing fields default to null (numeric), false (capabilities), or "unknown" (tokenizer). The key field is set to {provider}/{modelId}, litellmProvider is set to the provider name.

Example: Override context window for a self-hosted model
const overrides: ModelOverride[] = [
  { provider: "ollama", modelId: "llama3.1:70b", field: "maxInputTokens", value: "131072", updatedAt: Date.now() },
  { provider: "ollama", modelId: "llama3.1:70b", field: "pricing.input", value: "0", updatedAt: Date.now() },
  { provider: "ollama", modelId: "llama3.1:70b", field: "pricing.output", value: "0", updatedAt: Date.now() },
];

const resolved = resolveModelMeta("ollama", "llama3.1:70b", overrides);
// resolved.meta.maxInputTokens === 131072  (from override)
// resolved.meta.pricing.input === 0        (from override)
// resolved.meta.capabilities.vision === false (default, no override)
// resolved.sources["maxInputTokens"] === "override"
// resolved.sources["capabilities.vision"] === "default"
Example: Correct stale LiteLLM pricing
const overrides: ModelOverride[] = [
  { provider: "anthropic", modelId: "claude-sonnet-4-20250514", field: "pricing.input", value: "0.000003", updatedAt: Date.now() },
];

const resolved = resolveModelMeta("anthropic", "claude-sonnet-4-20250514", overrides);
// resolved.meta.pricing.input === 0.000003  (override wins)
// resolved.meta.maxInputTokens === <from LiteLLM>  (no override, uses default)
// resolved.sources["pricing.input"] === "override"
// resolved.sources["maxInputTokens"] === "default"

What this is NOT

  • Not a runtime fetcher. v0 does not fetch the LiteLLM JSON at runtime. The bundled snapshot is the base. A future version may add periodic refresh with a configurable interval and local cache file.
  • Not exhaustive. The catalog covers models present in LiteLLM's dataset. Local/self-hosted models (Ollama, vLLM, LM Studio) are unlikely to appear; their metadata comes from operator config or defaults to null.
  • Not the model list. listModels() fetches live model IDs from provider APIs. lookupModelMeta() enriches those IDs with static metadata from the bundled catalog. They are independent — a model can appear in listModels without metadata, and the catalog can contain models the operator hasn't provisioned.

Token estimation

Per ADR-0024, the harness needs to estimate token usage before each LLM call to decide whether compaction should fire. @kaged/llm exposes the estimator.

estimateTokens

interface EstimateInput {
  messages: Message[];                   // the candidate message list
  systemPrompt: string | string[];       // system prompt(s)
  modelMeta: ModelMeta | null;           // resolved via lookupModelMeta
  reservedOutputTokens?: number;         // budget reserved for the LLM's response (default 4096)
}

interface EstimateResult {
  inputTokens: number;                   // estimated input tokens (messages + system)
  reservedOutputTokens: number;          // echoed back; the harness uses this for the threshold check
  totalTokens: number;                   // inputTokens + reservedOutputTokens
  fraction: number;                      // totalTokens / modelMeta.contextWindow
  contextWindow: number | null;          // from modelMeta; null if metadata unavailable
  algorithm: "tiktoken" | "fallback";    // which estimator was used
}

function estimateTokens(input: EstimateInput): EstimateResult;

Behavior:

  • Algorithm preference. When modelMeta.tokenizer === "tiktoken" (most OpenAI and Anthropic models map cleanly), the estimator uses a local tiktoken implementation. When the model uses a different tokenizer (Gemini, Llama, etc.) or modelMeta is null, the estimator falls back to a character-count heuristic (chars / 3.5, rounded up). The algorithm field reports which was used.
  • Conservative. The estimator over-estimates rather than under-estimates. Wrong-direction errors (estimating too few tokens, then hitting context-length at provider) are caught by the reactive-fallback path in the harness — but they're expensive, so the estimator errs cautious.
  • System prompt counts. All system prompt content (including plugin-injected memory wrapped in <plugin:NAME> blocks) is counted.
  • Tool calls and results. Counted as part of the message they belong to. Tool-result bodies can be large; the estimator includes them in full.
  • reservedOutputTokens default. 4096. The harness can override per-call based on the operator's expected response length (a one-shot summarizer call might reserve 1500; a long-form coding session might reserve 8000).

fraction is the operative number. The harness compares fraction against the agent's configured upper threshold (default 0.85 per ADR-0024). When fraction >= upper_threshold, the harness triggers compaction.

contextWindow: null handling. When the resolved model is unknown to the metadata catalog (local Ollama model, brand-new release not yet in the bundled snapshot), modelMeta is null and contextWindow is null. The harness falls back to a conservative default (32k tokens) and emits a warning per call. Operators with unknown models should add metadata via local-config overrides (future work; not v0).

Example use from harness

import { estimateTokens, lookupModelMeta } from "@kaged/llm";

const modelMeta = lookupModelMeta(route.provider, route.model);
const estimate = estimateTokens({
  messages: candidateList,
  systemPrompt: assembledSystemPrompt,
  modelMeta,
  reservedOutputTokens: agentConfig.compaction?.reservedOutputTokens ?? 4096,
});

if (estimate.contextWindow !== null && estimate.fraction >= agentConfig.compaction.upper_threshold) {
  await runCompactionPipeline({ /* ... */ });
}

Performance

  • Tiktoken path: ~1-3ms for a typical session message list (50-200 messages).
  • Fallback path: ~0.1ms (string-length arithmetic).

Estimation runs before every LLM call. The cost is acceptable; it does not dominate latency.

tokenizer field on ModelMeta

ModelMeta is extended with an optional tokenizer: "tiktoken" | "gemini" | "llama" | "unknown" field. The bundled LiteLLM snapshot is augmented with this field where it can be determined from the LiteLLM data; unknown defaults to "unknown" and the estimator uses the fallback path.

Out of scope

  • Exact token counting via the provider's native tokenizer API. v0 estimates locally. Some providers offer a /tokenize endpoint; the harness may use these in the reactive-fallback path in a future amendment.
  • Per-prompt warm-cache of token counts. The estimator is stateless; computing the same message list twice does the work twice. A future cache keyed on message content hashes is plausible if profiling shows it matters.

Provider usage reporting

@kaged/llm exposes a usage reporting interface for querying provider quota and consumption data. The daemon calls these fetchers and relays the results to the UI for budget dashboards and routing/backoff decisions.

Types

All usage types are kaged's own. They provide a normalized schema for representing provider quota limits, consumption windows, and budget status.

type UsageUnit = "percent" | "tokens" | "requests" | "usd" | "minutes" | "bytes" | "unknown";

type UsageStatus = "ok" | "warning" | "exhausted" | "unknown";

interface UsageWindow {
  id: string;                       // stable identifier (e.g. "quota", "5h", "7d")
  label: string;                    // human label (e.g. "Quota", "5 Hour", "7 Day")
  durationMs?: number;              // window duration when known
  resetsAt?: number;                // absolute reset timestamp (ms since epoch)
}

interface UsageAmount {
  used?: number;
  limit?: number;
  remaining?: number;
  usedFraction?: number;            // 0..1
  remainingFraction?: number;       // 0..1
  unit: UsageUnit;
}

interface UsageScope {
  provider: string;
  accountId?: string;
  projectId?: string;
  modelId?: string;
  tier?: string;
  windowId?: string;
  shared?: boolean;                 // quota shared across models in the provider
}

interface UsageLimit {
  id: string;                       // unique per limit entry (e.g. "zai:tokens", "gemini-3-flash:free:default")
  label: string;                    // display label
  scope: UsageScope;
  window?: UsageWindow;
  amount: UsageAmount;
  status?: UsageStatus;
}

interface UsageReport {
  provider: string;
  fetchedAt: number;                // epoch ms
  limits: UsageLimit[];
  metadata?: Record<string, unknown>;
}

interface UsageFetchOptions {
  apiKey?: string;                  // for API-key-authenticated providers (zai)
  accessToken?: string;             // for OAuth providers (antigravity)
  projectId?: string;               // required by some OAuth providers
  baseUrl?: string;
  signal?: AbortSignal;
}

UsageReport is the primary shape the UI consumes. A single report contains multiple UsageLimit entries — one per quota window or limit type the provider exposes. The UI renders these as a budget dashboard: progress bars from usedFraction, status badges from status, reset countdowns from window.resetsAt.

Fetchers

Each provider with a quota endpoint gets a dedicated fetcher function. Fetchers are async, return null on failure (non-ok response, invalid payload, missing credentials), and never throw — the daemon handles the null gracefully.

Fetcher Provider Auth Endpoint
fetchZaiUsage(options) zai apiKey GET /api/monitor/usage/quota/limit on https://api.z.ai
fetchAntigravityUsage(options) antigravity accessToken + projectId POST /v1internal:fetchAvailableModels on https://cloudcode-pa.googleapis.com
fetchZaiUsage

Fetches the Z.AI coding plan quota. The API returns TOKENS_LIMIT and TIME_LIMIT entries, each with usage (limit), currentValue (used), percentage, remaining, and nextResetTime. The fetcher maps these to two UsageLimit entries:

  • zai:tokens — token consumption quota, unit "tokens".
  • zai:requests — request/time quota, unit "requests".

Both share the same UsageWindow (id: "quota") with resetsAt from the API's nextResetTime. Status classification: exhausted when usedFraction >= 1, warning when >= 0.9, ok otherwise.

The base URL is extracted from the baseUrl origin (strips the /api/anthropic path used for the LLM endpoint). Authorization header carries the raw API key (no Bearer prefix — Z.AI's monitor endpoint expects the key directly).

fetchAntigravityUsage

Fetches per-model quota from Antigravity's fetchAvailableModels endpoint. The response contains a models map where each model has quotaInfo / quotaInfos / quotaInfoByTier — each with remainingFraction (0..1) and resetTime.

The fetcher normalizes the nested quota info into flat UsageLimit entries:

  • Per-model, per-tier, per-window. ID format: {modelId}:{tier}:{windowId}. Label includes display name and tier when present.
  • unit: "percent". Antigravity reports fractions, not absolute token counts. used and remaining are expressed as percentages (0–100).
  • status classification. exhausted when remainingFraction <= 0, warning when <= 0.1, ok otherwise.
  • earliestReset. The report's metadata includes the earliest reset time across all limits, so the UI can show a single countdown.

Requires accessToken and projectId in UsageFetchOptions. Returns null if either is missing or the request fails.

Adding new fetchers

To add a fetcher for a new provider:

  1. Create packages/llm/src/usage/<provider>.ts with an exported async function fetch<Provider>Usage(options: UsageFetchOptions): Promise<UsageReport | null>.
  2. Export from packages/llm/src/index.ts.
  3. Add a row to the fetcher table above.
  4. The daemon wires it to the appropriate credential source and polling interval.

What this is NOT

  • Not a local accumulator. These fetchers query the provider's own quota API. They don't track usage locally by counting tokens in SQLite — that's a separate concern the daemon may add later for providers without quota endpoints.
  • Not polled automatically. The daemon decides when to fetch (on session start, on a timer, after a 429). The fetcher is a pure query function.
  • Not a routing gate. The daemon/harness may use usage data to inform fallback decisions, but the fetcher itself doesn't block or reroute calls.

Provider adapter contract

Each API shape is implemented by a single adapter function:

type ProviderStreamFn = (
  route: ProviderRoute,
  context: Context,
  options: StreamOptions,
) => LlmEventStream;

Four adapters ship in v0:

  1. anthropic-messagesPOST /v1/messages with stream: true. SSE events: message_start, content_block_start, content_block_delta, content_block_stop, message_delta, message_stop.
  2. openai-completionsPOST /v1/chat/completions with stream: true. SSE data: lines with choices[0].delta.
  3. openai-responsesPOST /v1/responses with stream: true. SSE events: response.created, response.output_item.added, response.content_part.delta, response.output_item.done, response.completed. Known gap (tech debt): the generic openai-responses adapter is a stub that requests reasoning (body.reasoning = { effort }) but does not parse the model's reasoning output — it has no handling for response.reasoning_summary_text.* events or reasoning output items, so thinking from a reasoning model routed through this adapter is silently dropped (never enters partial.content, neither live nor persisted). The openai-codex-responses adapter (item 6) implements reasoning capture correctly and is the reference for closing this gap. See STATUS.md § Known tech debt.
  4. google-generative-aiPOST /v1beta/models/{model}:streamGenerateContent?alt=sse. SSE data: lines with JSON chunks.

A fifth adapter ships alongside v0 for Antigravity (Google Cloud Code proxy):

  1. antigravityPOST /v1internal:streamGenerateContent?alt=sse. Model ID in body (not path). Request wrapped in Antigravity envelope ({ model, request: { ...googleGenAiBody } }). Bearer token auth (OAuth). Antigravity-specific headers (User-Agent, X-Goog-Api-Client, Client-Metadata). SSE wire format identical to google-generative-ai. Rate-limit-aware error handling: extracts RateLimitInfo from 429 responses (headers, structured error body, Go-style duration parsing). Usage extracted from every SSE frame for real-time consumption tracking. Thinking budget differs by model family (Claude vs Gemini). Tool declarations stay in Antigravity functionDeclarations form for all model families, but schema normalization is model-family-specific: Gemini schemas are reduced to the Google Schema proto subset, while Claude schemas use the battle-tested Antigravity JSON Schema subset from reference/opencode-antigravity-auth (unsupported constraints become description hints, invalid unions are flattened, empty object parameters receive a placeholder property).

  2. openai-codex-responsesPOST /codex/responses on https://chatgpt.com/backend-api with stream: true. OAuth via ChatGPT accounts (auth.openai.com, client ID app_EMoamEEZ73f0CkXaXp7hrann, PKCE + device-code flows). The access token is a JWT; the chatgpt_account_id claim is extracted and sent as the mandatory chatgpt-account-id header on every request. Uses input array (not messages) with OpenAI Responses API event types. Custom headers: OpenAI-Beta: responses=experimental, originator, session_id. Supports reasoning config (effort levels none/minimal/low/medium/high/xhigh + summary mode). Rate limits extracted from response headers (x-codex-primary-used-percent, x-codex-primary-window-minutes, x-codex-primary-reset-at, and secondary equivalents). Error parsing decodes usage_limit_reached, rate_limit_exceeded with friendly messages and reset times. Tool calls disable parallel execution (parallel_tool_calls: false). Session-based prompt caching via prompt_cache_key + prompt_cache_retention.

Each adapter:

  • Builds the provider-specific request body from Context + StreamOptions.
  • Sends fetch() with Accept: text/event-stream.
  • Parses the SSE response into StreamEvent values pushed onto the LlmEventStream.
  • Tracks Usage from provider-reported token counts.
  • Maps provider stop reasons to kaged's StopReason.
  • Handles AbortSignal by aborting the underlying fetch.

SSE parser

A shared SSE line parser handles the raw HTTP response body for all providers. It:

  1. Reads the ReadableStream<Uint8Array> from fetch response body.
  2. Splits on \n\n boundaries (SSE frame delimiter).
  3. Extracts event:, data:, and id: fields per frame.
  4. Yields { event: string | null, data: string } objects.
  5. Handles [DONE] sentinel (OpenAI) and empty keepalive frames.

This is a rewrite of pi-ai's readSseEvents utility in pure TS with no external deps.

Partial-JSON parser

Tool call arguments stream incrementally. A partial-JSON parser provides best-effort parsing of incomplete JSON so the UI can show tool arguments as they arrive (e.g., file path before content is complete).

Behavior:

  • Returns {} when no valid JSON prefix exists.
  • Closes unclosed strings, arrays, and objects.
  • Never throws — always returns the best parse possible.
  • Final toolcall_end event uses standard JSON.parse (must succeed; error → StopReason: "error").

Error taxonomy

All provider errors are surfaced as StreamEvent with type: "error" and the error detail in error.errorMessage. The package does not throw exceptions for provider failures — errors are events.

Error class Cause errorMessage contains
auth_failed 401/403 from provider HTTP status + provider error body
rate_limited 429 from provider HTTP status + Retry-After if present
context_too_long 400 with context-length signal Provider's error message
model_not_found 404 or model-not-available Provider + model ID
provider_error 500/502/503 from provider HTTP status + body excerpt
network_error DNS failure, connection refused, timeout Error message from fetch
aborted AbortSignal triggered "Request aborted"
parse_error Malformed SSE or JSON from provider Raw data excerpt

Retry policy

v0 ships with no automatic retry. The caller (harness/daemon) decides whether to retry, switch providers via fallback chain, or surface the error to the operator. This keeps the package simple and the retry policy visible.


Auth model

Credentials arrive via ProviderRoute.apiKey (resolved by @kaged/harness from @kaged/local-config). The package never reads environment variables or config files directly.

API key resolution (in @kaged/local-config, not here)

The ProviderSchema in local config already has api_key and api_key_env. The resolution order:

  1. api_key (literal value in local.toml — for dev/testing).
  2. api_key_env (env var name; harness reads Bun.env[name] at resolve time).

OAuth providers (kaged's distinctive path)

A class of providers exists that Mastra and Vercel (publishers of @ai-sdk/<provider> packages) do not ship: OAuth-based access to consumer subscriptions (Claude Pro, ChatGPT Plus, etc.). The terms of service for programmatic use of those subscriptions are in a gray area that corporate vendors choose to avoid. kaged is operator-owned, self-hosted; whether to use an OAuth-backed personal subscription with kaged is a decision the operator makes about their own account and provider relationship.

Per ADR-0014, kaged makes this an explicit architectural slot. @kaged/llm is the provider layer Mastra calls into via the LanguageModelV2 shim — and @kaged/llm is operator-owned code that may ship OAuth provider adapters Mastra's ecosystem cannot. The operator owns the TOS choice; kaged provides the capability.

OAuth token storage extends ProviderSchema with optional OAuth fields. The provider's type field signals the variant:

[providers.claude-pro]
type = "anthropic-oauth"             # signals the OAuth provider variant

[providers.claude-pro.oauth]
access_token = "..."
refresh_token = "..."
expires_at = 1716000000
token_url = "https://..."
client_id = "..."

The harness is responsible for refreshing expired tokens before passing apiKey to @kaged/llm. The LLM package never does OAuth flows — it receives a ready-to-use bearer token (or whatever credential shape the OAuth variant requires).

v0 status. v0 does not ship OAuth provider adapters. The schema extension and the architectural slot are documented here for forward compatibility. API keys are the only supported auth in v0. A follow-up ADR + amendment will land OAuth when scheduled — see local-config.md for the credential storage shape that will land alongside.


Mastra integration (LanguageModelV2 shim)

Per ADR-0014, @kaged/llm exposes a LanguageModelV2 factory that lets Mastra v1.x consume any provider this package supports.

Public API

import { kagedModel } from "@kaged/llm/mastra";       // separate entry point
import type { LanguageModelV2 } from "@ai-sdk/provider-v5";

function kagedModel(
  route: ProviderRoute,
  options?: StreamOptions,
): LanguageModelV2;

kagedModel(route) returns a Vercel-AI-SDK-shaped LanguageModelV2 whose doStream and doGenerate methods wrap streamModel / completeModel. The returned object is the value the harness passes to Mastra's Agent.model field.

Mapping at the boundary

The shim translates in both directions:

Inbound (Mastra → kaged):

  • LanguageModelV2CallOptions.prompt (Vercel's message array) → kaged Context.messages
  • LanguageModelV2CallOptions.tools → kaged Context.tools (Mastra's tool shape is JSON-Schema-aligned with kaged's)
  • LanguageModelV2CallOptions.abortSignal → kaged StreamOptions.signal
  • LanguageModelV2CallOptions.temperature / maxTokens / topP / stopSequences / toolChoice → corresponding StreamOptions fields
  • LanguageModelV2CallOptions.headers → merged into kaged's outbound request headers

Outbound (kaged → Mastra):

  • kaged StreamEvent.text_delta → Vercel { type: "text-delta", textDelta }
  • kaged StreamEvent.thinking_delta → Vercel { type: "reasoning", textDelta }
  • kaged StreamEvent.toolcall_end → Vercel { type: "tool-call", toolCallId, toolName, args }
  • kaged StreamEvent.done → Vercel { type: "finish", finishReason, usage }
  • kaged StreamEvent.error → Vercel { type: "error", error } (or terminal finish with error reason, depending on what Mastra's stream consumer expects)

The shim never throws. Provider errors that arrive as kaged error events become Vercel error parts; the LanguageModelV2 contract surfaces them to Mastra, which routes them through the Processor pipeline's error hooks.

Why a separate entry point (@kaged/llm/mastra)

Mastra is optional at the @kaged/llm boundary. The shim lives behind a separate entry point so a downstream consumer that only wants the raw streamModel API doesn't pay for the @ai-sdk/provider-v5 type import. The main @kaged/llm exports have no Vercel-shaped types.

@kaged/harness imports @kaged/llm/mastra. The daemon's provider test endpoint imports the main @kaged/llm.


Integration with harness and daemon

Per ADR-0014 and agent.md, @kaged/llm is consumed two ways. The same provider adapters run in both:

Primary path — Mastra agent loop (via the LanguageModelV2 shim)

daemon (handlePostMessage)
  → harness (runPrimary)
  → harness (routeModel → ProviderRoute)
  → harness (kagedModel(route) → LanguageModelV2)
  → Mastra (new Agent({ model: ... }).stream(messages))
  → Mastra calls LanguageModelV2.doStream(opts)
  → shim translates → @kaged/llm.streamModel(route, context, options)
  → @kaged/llm returns LlmEventStream
  → shim translates StreamEvent → LanguageModelV2StreamPart
  → Mastra emits ChunkType on fullStream
  → harness maps ChunkType → WsFrame
  → daemon publishes WsFrame to session subscribers

The agent loop, tool dispatch, supervisor / sub-agent topology, Processor pipeline, and suspend / resume checkpoints are all Mastra's responsibility. @kaged/llm is the provider layer Mastra calls into.

Direct path — provider test, ad-hoc calls (no agent loop)

daemon (handleTestProvider, etc.)
  → @kaged/llm.completeModel(route, context, options)
  → returns AssistantMessage

The provider test endpoint and any future "I just want to ping the provider" call path uses completeModel / streamModel directly. Same code, no Mastra involvement.

Why the same code in both paths

Per ADR-0014, kaged maintains one provider implementation, not two. The shim is a translation layer; the underlying HTTP + SSE work happens in the same provider adapter (packages/llm/src/providers/*) regardless of which path called it. Custom headers, retry policy, OAuth refresh, telemetry hooks — all live in @kaged/llm and apply to both paths.


Package structure

packages/llm/
  package.json
  tsconfig.json
  src/
    index.ts                 # public API: streamModel, completeModel, types re-export
    mastra.ts                # separate entry point: kagedModel (LanguageModelV2 shim)
    types.ts                 # Message, Context, Tool, StreamEvent, Usage, etc.
    stream.ts                # LlmEventStream class (AsyncIterable + result())
    provider-map.ts          # providerName → API shape mapping
    models.ts                # listModels, humanizeModelId — model discovery
    model-meta.ts            # ModelMeta type, lookupModelMeta, calculateCost
    dispatch.ts              # streamModel/completeModel — resolves shape, delegates
    mastra-model.ts          # the LanguageModelV2 shim implementation (re-exported via mastra.ts)
    usage-types.ts           # UsageReport, UsageLimit, UsageWindow, UsageFetchOptions, etc.
    sse-parser.ts            # shared SSE line parser
    partial-json.ts          # best-effort incomplete JSON parser
    data/
      litellm-pricing.json   # bundled LiteLLM snapshot (updated at release time)
    providers/
      anthropic.ts           # anthropic-messages adapter
      openai-completions.ts  # openai chat completions adapter
      openai-responses.ts    # openai responses API adapter
      google.ts              # google generative AI adapter
      antigravity.ts         # antigravity (Google Cloud Code proxy) adapter
      codex/
        constants.ts         # Codex URLs, headers, JWT claim extraction
        request-transformer.ts # kaged messages → Codex input array format
        response-handler.ts  # rate limit parsing, error formatting
    usage/
      zai.ts                 # Z.AI coding plan quota fetcher
      antigravity.ts         # Antigravity per-model quota fetcher
  __tests__/
    types.test.ts
    sse-parser.test.ts
    partial-json.test.ts
    dispatch.test.ts
    model-meta.test.ts       # lookupModelMeta, calculateCost, key normalization
    mastra-model.test.ts     # shim translation tests
    providers/
      anthropic.test.ts
      openai-completions.test.ts
      openai-responses.test.ts
      google.test.ts
      antigravity.test.ts    # antigravity adapter tests (text, tools, thinking, rate-limit, usage)
      codex/
        constants.test.ts       # JWT extraction, URL constants
        request-transformer.test.ts # message → Codex input transform
        response-handler.test.ts   # error parsing, rate limit extraction

Testing notes

  • Unit tests for SSE parser: feed raw byte chunks, assert parsed frames.
  • Unit tests for partial-JSON parser: incomplete objects, arrays, strings, nested structures.
  • Unit tests for each provider adapter: mock fetch to return canned SSE streams, assert StreamEvent sequence and final AssistantMessage shape.
  • Integration tests for streamModel dispatch: mock fetch, verify correct provider adapter is selected by provider name.
  • Error tests: mock fetch returning 401, 429, 500, network errors, malformed SSE; assert correct error events.
  • Abort tests: fire AbortController.abort() mid-stream, assert aborted event and partial AssistantMessage.
  • Mastra shim tests (mastra-model.test.ts): construct a LanguageModelV2 via kagedModel(route). Feed it canned LanguageModelV2CallOptions. Mock the underlying streamModel call. Assert: (a) inbound mapping translates Vercel-shape options into kaged Context + StreamOptions correctly; (b) outbound mapping translates kaged StreamEvents into Vercel-shape LanguageModelV2StreamParts; (c) errors arriving as kaged error events surface as Vercel error parts without throwing; (d) abortSignal passed in is forwarded into StreamOptions.signal.
  • Model metadata tests (model-meta.test.ts):
    • lookupModelMeta key normalization. Assert lookupModelMeta("anthropic", "claude-sonnet-4-20250514") returns the correct ModelMeta from the bundled catalog. Assert lookupModelMeta("google", "gemini-2.0-flash") maps to the "gemini/" prefix correctly.
    • Unknown model. Assert lookupModelMeta("anthropic", "nonexistent-model") returns null.
    • Capability extraction. Assert a known reasoning model (e.g. claude-3-7-sonnet) has capabilities.reasoning === true. Assert a non-reasoning model has capabilities.reasoning === false.
    • Pricing extraction. Assert pricing.input, pricing.output are non-zero for a known paid model. Assert pricing.reasoning is non-null for models with output_cost_per_reasoning_token.
    • calculateCost with metadata. Provide token counts and a ModelMeta. Assert the cost breakdown matches expected values (input tokens × input price, output tokens × output price, reasoning tokens × reasoning price).
    • calculateCost without metadata. Pass meta: null. Assert all costs are 0.
    • calculateCost reasoning fallback. For a model where pricing.reasoning is null, assert reasoning tokens are priced at pricing.output.

All tests use bun:test. No live provider calls in unit tests. Integration tests against real providers are manual/operator-initiated (not in CI).


Open questions

None. All four architectural questions resolved:

  1. ✅ OAuth stored in local config by extending ProviderSchema; OAuth providers are an architectural slot per ADR-0014. v0 ships API keys only.
  2. ✅ Streaming: dual call path — primary via Mastra (LanguageModelV2 shim), direct via streamModel / completeModel. Same provider adapters in both. See § Integration with harness and daemon, and ADR-0014.
  3. ✅ Package scope: pure provider interface, not general-purpose.
  4. ✅ Tool surface: yes, pi-ai's Tool shape (JSON Schema parameters).

Amendments

2026-06-03 — openai-responses reasoning capture noted as a known gap

  1. openai-responses adapter documented as a reasoning stub. The v0 adapter list now records that the generic openai-responses adapter requests reasoning but never parses reasoning output events (response.reasoning_summary_text.* / reasoning items), so reasoning content is dropped for any reasoning model routed through it. This is captured as tech debt in STATUS.md; the openai-codex-responses adapter is the reference implementation for closing it. No code change accompanies this amendment — it documents existing behavior surfaced during the reasoning-ordering fixes across the streaming adapters.

2026-05-31 — GitHub Copilot driver + device-code OAuth flow

  1. New copilot driver added. GitHub Copilot joins the driver catalog as an openai-completions provider with default base URL https://api.githubcopilot.com, auth mode oauth, and default test model gpt-4o.
  2. Device-code OAuth flow added. @kaged/llm/oauth now supports providers that authenticate via device code instead of PKCE browser redirect. ProviderOAuthConfig gains optional deviceCode configuration and login start results can return userCode / verificationUri for daemon→UI relay.
  3. Copilot-specific request headers documented. Requests may add X-Initiator, Copilot-Vision-Request, and Openai-Intent dynamically based on the conversation and image input.
  4. Enterprise URL resolution documented. Copilot keeps the public default host (api.githubcopilot.com) but can derive enterprise hosts as https://copilot-api.{ghe-domain} when the operator authenticated against GitHub Enterprise.
  5. Post-login model policy activation added. Copilot runs a best-effort post-login hook that enables known models requiring policy acceptance before first use.
  6. Token exchange behavior documented. Login stores the long-lived GitHub OAuth token. At request time, the OpenAI-compatible adapter exchanges it against GET https://api.github.com/copilot_internal/v2/token to obtain the short-lived Copilot API bearer token.
  7. Catalog tables updated. Both the API-shape resolution table and the v0 driver catalog now include copilot.
  8. Package structure updated. Added Copilot provider constants/helpers (src/providers/copilot/) plus device-code OAuth flow support in src/oauth/device-code.ts and matching tests under __tests__/oauth/ and __tests__/providers/copilot/.

2026-05-27 — ADR-0024: estimateTokens API for pre-call compaction threshold check

Per ADR-0024:

  1. New § Token estimation added (under § ModelMeta). Documents estimateTokens() — the function the harness calls before every LLM call to compute the current token usage and compare against the agent's configured compaction threshold.
  2. EstimateInput and EstimateResult types defined. Inputs: messages, system prompt, model metadata, optional reserved output budget. Outputs: input tokens, reserved output tokens, total, fraction-of-context-window, context window size, algorithm used.
  3. Algorithm selection. Tiktoken when the model uses an OpenAI-compatible tokenizer; character-count fallback otherwise. The estimator over-estimates rather than under-estimates (reactive fallback in the harness catches the wrong-direction cases).
  4. tokenizer field added to ModelMeta. Values: "tiktoken" | "gemini" | "llama" | "unknown". Used by the estimator to choose the algorithm; populated from the bundled LiteLLM snapshot where determinable, defaults to "unknown".
  5. Unknown-model handling. When lookupModelMeta returns null (model not in catalog), the estimator falls back to a 32k conservative default for contextWindow and emits a per-call warning. Operators with unknown models will eventually have a local-config override path (future work; not v0).
  6. Constrained-by list extended with ADR-0024.

2026-05-23 — LanguageModelV2 shim + dual call-path + OAuth provider role

Driven by ADR-0014:

  1. LanguageModelV2 shim added (new § Mastra integration). kagedModel(route) is a factory exported from @kaged/llm/mastra that returns a Vercel-AI-SDK-shaped LanguageModelV2. Mastra v1.x uses this as Agent.model. The shim is the only Mastra-aware code in @kaged/llm.
  2. Integration with harness rewritten. Replaced the previous single-path call chain with the dual-path description: primary path (agent loop via Mastra) and direct path (provider test, ad-hoc calls). Same provider adapters run in both.
  3. OAuth providers section rewritten. Was "OAuth (future)" with a forward-compat note. Now "OAuth providers (kaged's distinctive path)" — @kaged/llm's ability to ship OAuth / subscription adapters Mastra / Vercel won't is the reason the dual-path architecture exists, not an afterthought. Operator-owns-TOS-choice stance documented. v0 status unchanged: API keys only ship in v0.
  4. Package structure updated. Added mastra.ts (entry point) and mastra-model.ts (implementation) to the src/ listing, plus mastra-model.test.ts to __tests__/.
  5. Constraint table + Constrained-by list updated. New row pointing at ADR-0014. Constrained by list now includes ADR-0012 and ADR-0014.
  6. Open questions cross-referenced. Items #1 and #2 cite ADR-0014 for the now-concrete resolutions.

2026-05-23 — Driver catalog spec

  1. New § Driver catalog added (under § API shape resolution). Documents DriverInfo, DriverAuthMode, listDrivers(), and the full v0 driver table with labels, auth modes, base URLs, local flags, and test models.
  2. UI integration contract documented. Specifies how the daemon relays known_drivers: DriverInfo[] and how the UI consumes it (driver select rendering, base URL pre-fill, conditional credentials, contextual badges). No driver metadata hardcoded in the UI.

2026-05-23 — Model discovery functions

  1. New § Model discovery added (under § Driver catalog). Documents listModels() and humanizeModelId() — the two functions the daemon's model catalog endpoints consume.
  2. listModels() fetches live model lists from provider APIs. Covers all four API shapes with per-shape extraction logic (OpenAI /v1/models, Anthropic paginated /v1/models, Google paginated /v1beta/models with generateContent filter). Returns { ok, models, error? } — never throws.
  3. humanizeModelId() generates display names from model IDs (hyphen/underscore → space, title-case). Used as fallback when operators haven't set an explicit name in their model catalog.
  4. "Not normative for" list updated. Replaced the stale "deferred" model catalogs note with accurate scope: this package provides discovery functions; persistence is @kaged/local-config's responsibility.

2026-05-24 — Model metadata catalog + pricing

Driven by streaming-first enrichment work (provider:model labels, post-message stats bar, cost tracking in UI):

  1. Usage.cost.reasoning field added. The cost object inside Usage gains a reasoning: number field to separately track the dollar cost of reasoning/thinking tokens. Previously, reasoning tokens were silently lumped into the output cost; now callers can display them distinctly.
  2. New § Model metadata catalog added (under § Model discovery). Defines:
    • ModelMeta — the extracted metadata type for a single model (context limits, pricing per-token, capability flags, deprecation date). Sourced from LiteLLM's community-maintained JSON; @kaged/llm ships a bundled snapshot updated at release time.
    • lookupModelMeta(provider, modelId) — maps kaged's "provider:model" convention to LiteLLM keys and returns ModelMeta | null. Missing metadata is never fatal.
    • calculateCost(usage, meta) — pure utility computing dollar cost from token counts and pricing metadata. Falls back to output pricing for reasoning tokens when pricing.reasoning is null. Returns all-zero when metadata is unavailable.
    • Operator overrides — local config can override pricing and capabilities per model, taking precedence over the bundled catalog.
  3. Package structure updated. Added model-meta.ts (types + functions), data/litellm-pricing.json (bundled snapshot), and model-meta.test.ts.
  4. Testing notes updated. Added model metadata test cases: key normalization, unknown model, capability extraction, pricing extraction, calculateCost with/without metadata, reasoning price fallback.

2026-05-25 — Antigravity provider adapter

Adds Antigravity (Google Cloud Code proxy) as the fifth API shape and provider adapter:

  1. New antigravity API shape added. ApiShape union extended. Antigravity gets its own shape rather than reusing google-generative-ai — URL structure (/v1internal:streamGenerateContent, model in body not path), auth mechanism (Bearer token / OAuth, not API key), request envelope ({ model, request: { ...innerBody } }), and rate-limit semantics all differ.
  2. streamAntigravity adapter implemented (providers/antigravity.ts). Full streaming adapter with: Antigravity envelope wrapping, Bearer token auth, Antigravity-specific headers (User-Agent, X-Goog-Api-Client, Client-Metadata), line-based SSE parsing (Antigravity-specific wire format; see 2026-06-03 amendment), thinking/text/toolCall streaming, per-frame usage extraction, abort handling, and rate-limit-aware 429 error handling.
  3. RateLimitInfo type exported. Structured rate-limit info (retryAfterMs, reason, quotaResetTime, message) extracted from 429 responses. Parses Go-style compound durations (1h16m0.667s), retry-after-ms/retry-after headers, and structured error body details (RetryInfo, ErrorInfo, QuotaFailure). Surfaced in errorMessage for upstream rotation/backoff decisions.
  4. Thinking budget differs by model family. Claude models (detected by modelId.includes("claude")) use different budget ranges than Gemini models — Claude omits includeThoughts and uses higher budgets at each effort level.
  5. Thinking blocks stripped from outgoing requests. Assistant message history omits thinking blocks when building contents — Antigravity generates fresh thinking each turn (matching the reference plugin's approach). If stripping leaves a history turn with no valid parts, the adapter omits that turn instead of sending an empty contents entry.
  6. listModels returns informational error for antigravity. Antigravity does not expose a model listing endpoint; models must be configured in the project DSL.
  7. Driver catalog updated. New antigravity entry: label "Antigravity", base URL https://cloudcode-pa.googleapis.com, auth mode oauth, test model gemini-2.5-flash.
  8. Dispatch wired. dispatch.ts routes antigravity shape to streamAntigravity.
  9. Package structure updated. Added providers/antigravity.ts and __tests__/providers/antigravity.test.ts.
  10. 31 new tests. Text streaming, per-frame usage tracking, tool calls, thinking, rate-limit extraction (structured body, message fallback, header), safety filter, request format (URL, Bearer auth, envelope, headers, thinking budgets per model family, thinking block stripping).

2026-06-03 — Antigravity SSE and history sanitation

  1. Antigravity uses a line-based SSE parser, not the shared parseSseStream. Antigravity's wire format sends data: {json}\n lines where each line is a complete event — it does not use \n\n double-newline framing like standard SSE. The adapter uses a local parseAntigravityStream generator that splits on \n, processes each data:-prefixed line individually, and buffers partial lines across network chunks. This matches the battle-tested OpenCode reference plugin (createStreamingTransformer in reference/opencode-antigravity-auth/). The shared parseSseStream (which waits for \n\n boundaries) must NOT be used for Antigravity — doing so causes silent output loss.
  2. Antigravity strips empty sanitized history turns. During contents construction, empty text parts are omitted. If a user or assistant history message has no remaining valid parts after provider-specific sanitation (for example, an assistant turn containing only stripped thinking), the adapter omits the entire history turn rather than sending { parts: [] }.
  3. Testing notes updated. Antigravity provider tests cover split/coalesced SSE frames carrying thinking/text output and empty sanitized history turns.

2026-05-30 — ADR-0028: OAuth provider module

Per ADR-0028:

  1. OAuth provider module added (src/oauth/). Generic framework for any 3rd-party OAuth-backed LLM provider. 12 modules: types.ts (ProviderOAuthConfig, ProviderTokens), pkce.ts (PKCE via crypto.subtle), token-store.ts (per-provider JSON at $XDG_CONFIG_HOME/kaged/oauth/<provider>-tokens.json), authorize.ts (auth URL construction from config), callback-server.ts (temporary Bun.serve()), token-exchange.ts (code exchange with post-login hook support), refresh.ts (proactive + reactive refresh, resolveOAuthCredentials), login.ts, logout.ts, status.ts, index.ts (barrel).
  2. Driver catalog extended. ProviderOAuthConfig declarations in PROVIDER_OAUTH_CONFIGS alongside existing DRIVER_AUTH_MODES. DriverInfo gains optional oauth field. resolveOAuthConfig(driverName) export.
  3. Antigravity config registered. Full Google OAuth config with post-login hook for project ID resolution (fetchProjectId + onboardUser logic migrated from daemon).
  4. Package export added. @kaged/llm/oauth entry point for daemon consumption.
  5. Constrained-by list extended with ADR-0028.
  6. Package structure updated. Added src/oauth/ (12 files) and __tests__/oauth/ (3 test files).

2026-05-30 — ADR-0026: Model metadata overrides + cost management + usage pipeline

Per ADR-0026:

  1. § Operator overrides rewritten (under § Model metadata catalog). Replaced the local.toml override description with the full DB-backed override system: ModelOverride storage shape, sparse key-value schema, overridable fields table with dot-notation for nested fields, resolveModelMeta function with ResolvedModelMeta return type (merged result + per-field source tracking), examples for self-hosted models and stale-pricing corrections.
  2. resolveModelMeta replaces lookupModelMeta for override-aware callers. The harness, compaction, and token estimator call resolveModelMeta instead of lookupModelMeta. The merge path is: LiteLLM default → apply overrides → return. Models not in LiteLLM get a synthetic ModelMeta built from overrides only.
  3. Context window overrides feed compaction. Per the ADR-0024 amendment, maxInputTokens is overridable. The compaction system uses the effective (merged) context window for threshold calculation.
  4. Constrained-by list extended with ADR-0026.
  5. Provider usage pipeline documented (§ Provider usage reporting extended). On-demand fetch with DB cache (provider_usage_cache table), cache invalidated after every LLM call to that provider, manual refresh endpoint for out-of-band usage.
  6. Cost accumulation per provider. New provider_spend_events table records cost per LLM call. Daemon sums events per rolling window (5h, 7d) and compares against configured limits before each call.
  7. Spend limit enforcement. provider_spend_limits table stores per-provider limits (absolute USD per window, percentage of rolling window for quota-based providers). Enforcement is a hard block before LLM dispatch — not a soft warning.

2026-05-29 — Z.AI driver + provider usage reporting

  1. New zai driver added. Maps to anthropic-messages API shape with base URL https://api.z.ai/api/anthropic. Label "Z.AI (GLM Coding Plan)". Test model glm-5.1. Auth mode api_key. The existing streamAnthropic adapter handles non-Anthropic base URLs by switching from X-Api-Key to Authorization: Bearer — no new provider adapter needed.
  2. New § Provider usage reporting added (under § Driver catalog). Defines the normalized types (UsageReport, UsageLimit, UsageWindow, UsageAmount, UsageScope, UsageFetchOptions) and two provider-specific fetchers:
    • fetchZaiUsage — queries GET /api/monitor/usage/quota/limit on https://api.z.ai. Parses TOKENS_LIMIT and TIME_LIMIT entries into UsageLimit entries with usedFraction/remainingFraction, status classification (ok/warning/exhausted), and resetsAt from the API's nextResetTime.
    • fetchAntigravityUsage — queries POST /v1internal:fetchAvailableModels on Antigravity's endpoint. Parses per-model, per-tier quota info (fraction-based) into flat UsageLimit entries. Requires OAuth accessToken + projectId.
  3. UsageFetchOptions supports both auth patterns. apiKey for API-key providers (zai), accessToken + projectId for OAuth providers (antigravity). Each fetcher validates its own requirements at the top and returns null if missing.
  4. API shape resolution table updated. New zai row.
  5. Driver catalog table updated. New zai entry with full metadata.
  6. Package structure updated. Added usage-types.ts, usage/zai.ts, usage/antigravity.ts.

References