ADR-0029: Structured operational logging
- Status: Proposed
- Date: 2026-05-31
- Deciders: @karasu
- Supersedes: —
- Superseded by: —
Context
kaged has three logging-shaped things that don't talk to each other:
@kaged/utils/logger.ts— a JSON-structured, daily-rotating file logger. Supports levels (debug,info,warn,error), configurable retention, console mirroring. Exists, but the daemon does not use it — every daemon module writes rawconsole.error()to stderr instead (42 calls acrossmain.tsandtask-recovery.ts).Audit log (
/api/v1/audit) — an append-only event trail stored in SQLite (audit_eventstable). Covers auth events, plugin installs, compaction lifecycle, prompt edits. Queried by the UI's audit screen. This is not operational logging — it answers "what happened?" at the policy level, not "what went wrong?" at the runtime level.UI log drawer — a slide-up panel with filter chips (
daemon,session,subagent,audit) and aLogEntrytype. Currently empty at runtime — the drawer rendersconst entries: LogEntry[] = []because there is no API endpoint feeding it. The component structure is ready; the data pipeline is not.
The pain
When compaction fails with {"error":{"code":"internal","message":"Failed to run compaction."}}, there is no way for the operator to see why without SSH-ing into the host and reading tmux output. The error was swallowed because the daemon has no structured logging — only scattered console.error calls that vanish into the terminal scrollback.
ADR-0013 says "kaged ships sane defaults if Langfuse is not configured: structured logs to stdout (and to the kaged log viewer in the UI), enough to debug most runs without external infra." This commitment exists on paper but not in code. This ADR delivers it.
What this ADR is not
- Not a replacement for Langfuse tracing (ADR-0013). Operational logs answer "what is the daemon doing / what went wrong?" Langfuse answers "what did the model do, with what tokens, at what cost?" Different data, different consumers, different retention.
- Not about audit logging. Audit events are policy-level (who did what). Operational logs are runtime-level (what happened inside the daemon). Both coexist.
- Not about UI emitting logs. The UI is a consumer of logs via the log drawer. UI-side log emission is deferred.
Decision
The daemon emits structured operational logs to two sinks: SQLite (queryable by the UI via HTTP API) and rotating flat files (survivable across DB resets, grep-friendly). The existing
@kaged/utils/logger.tsis adopted as the file sink. A newlogstable in SQLite provides the queryable sink. Plugins log through a structuredlogmethod on their JSON-RPC channel; the daemon captures plugin stderr as unstructureddaemonlogs withplugincontext. Retention and level are configurable inlocal.toml.
Sink 1: SQLite logs table
Operational logs go into a logs table alongside sessions, runs, and audit events. This gives the UI a paginated, filterable, full-text-searchable log source without extra infrastructure.
CREATE TABLE IF NOT EXISTS logs (
id TEXT PRIMARY KEY, -- ULID
ts INTEGER NOT NULL, -- epoch ms
level TEXT NOT NULL, -- debug | info | warn | error
source TEXT NOT NULL, -- daemon | plugin | session | subagent
message TEXT NOT NULL,
project_id TEXT, -- nullable: daemon-level logs have no project
session_id TEXT, -- nullable: non-session logs have no session
context TEXT, -- JSON blob for structured fields
plugin_name TEXT -- nullable: set when source = 'plugin'
);
Index on (level, ts) for filtered queries, and (project_id, ts) for project-scoped views.
Retention: prune rows older than the configured retention window on daemon boot (background task, not blocking startup). Default: 7 days, max 10 000 rows (whichever limit is hit first).
Sink 2: Rotating flat files (existing @kaged/utils/logger.ts)
Adopt the existing file logger as-is. The daemon configures it on boot from local.toml [logging] section. File logs are the survival copy — they outlive DB resets, can be grep'd, and can be shipped to external collectors (Loki, Datadog) by the operator.
Both sinks receive the same entries. File logs are the authoritative record; SQLite is the queryable index for the UI.
Levels and defaults
| Level | Meaning | Production default | Development default |
|---|---|---|---|
error |
Something failed, operator action may be needed | Always on | Always on |
warn |
Something unexpected but recovered | Always on | Always on |
info |
Normal operational events (startup, shutdown, plugin loaded, session created) | Always on | Always on |
debug |
Detailed internals (hook firing, tool registration, context resolution) | Off | On |
Production default minimum level: warn (7 days, 10k entries).
Development default minimum level: debug (7 days, 50k entries) — bun test is not affected; this is the running daemon's log level.
The daemon checks NODE_ENV or a KAGED_ENV env var. If "development", debug is enabled. Otherwise, production defaults apply. The operator can override both via local.toml.
Sources (log categories)
The source field maps to the UI's existing LogFilterKind. Existing kinds are retained and extended:
| Source | What emits it | UI filter chip |
|---|---|---|
daemon |
Daemon core: startup, shutdown, gates, config loading, internal errors | daemon |
plugin |
Plugin lifecycle: load, hook firing, tool registration, errors. Includes plugin stderr captures. | daemon (or a future plugin chip) |
session |
Session lifecycle: create, state transitions, compaction, idle | session |
subagent |
Subagent invocations: spawn, cage setup, exit, errors | subagent |
audit |
Audit events (already served by /api/v1/audit) |
audit |
The audit source is special: audit events continue to flow through the existing /api/v1/audit endpoint and the audit_events table. The log viewer can show audit events alongside operational logs, but they are stored in their own table with their own schema. The source: "audit" entries in the logs table are lightweight references (not duplicates) — or the UI can query both tables and merge by timestamp. The spec will settle this.
Plugin logging
Project plugins (subprocesses, ADR-0008) get a structured log method in their JSON-RPC protocol:
{
"jsonrpc": "2.0",
"method": "log",
"params": {
"level": "error",
"message": "Failed to preserve messages during compaction",
"context": { "compaction_id": "01JX...", "retained_count": 0 }
}
}
The daemon writes these to both sinks with source: "plugin" and plugin_name set.
Plugin stderr is captured line-by-line and written as source: "daemon" logs with plugin_name set and context.capture: "stderr". This catches unstructured errors from plugin processes without the plugin needing to use the structured protocol.
System plugins (in-process, @kaged/plugin-types) already have PluginLogger in their context. That interface is wired to the same dual-sink pipeline.
Configuration in local.toml
New [logging] section:
[logging]
level = "warn" # minimum level: debug | info | warn | error
retention_days = 7 # prune logs older than this
max_entries = 10000 # prune oldest rows when exceeded
dir = "/var/log/kaged" # override file log directory (optional)
console = true # mirror to stderr (default: false in production)
All fields optional. Defaults applied when absent.
HTTP API
New endpoints for the log drawer:
GET /api/v1/logs— global daemon logs (no project scope)GET /api/v1/projects/:id/logs— project-scoped logsGET /api/v1/sessions/:id/logs— session-scoped logs
Query parameters: level, source, since (epoch ms), until (epoch ms), q (string search on message), limit (default 100, max 500), cursor (ULID-based pagination).
Response: { "entries": LogEntry[], "cursor": string | null } — most recent first, cursor-based pagination for the UI's "load more on scroll" pattern.
UI log drawer behavior
- Opening the drawer requests the last N entries (via
limit) for the current scope (project or session). Most recent at the top — no initial scroll. - Scrolling to the bottom triggers a load-more request using the cursor from the previous response.
- Adding/changing a filter re-requests with the filter applied, maintaining the N-entry window.
- String search is server-side (
qparameter) — the UI sends the query, the daemon doesLIKEor FTS onmessage. - Real-time updates: initially request/response. A future iteration can add a WebSocket subscription for live log tailing. Not in scope for this ADR.
Consequences
What this commits us to
- Migrating all 42
console.errorcalls in the daemon to structured logger calls. Mechanical but noisy. - A
logstable in the storage schema (bumpsSCHEMA_VERSION). - A
[logging]section inlocal.tomlschema (extendsLocalConfigSchema). - Three new HTTP endpoints in the daemon.
- The UI log drawer stops being a placeholder and starts fetching real data.
- Plugin JSON-RPC protocol gains a
lognotification method. - The existing
@kaged/utils/logger.tsgets adopted into the daemon's startup path.
What this forecloses
- No third-party log sinks in this ADR. The operator can point external tools at the flat files or the SQLite DB. A future plugin could add Loki/Elasticsearch forwarding, but that's not spec'd here.
- No log forwarding to Langfuse. Langfuse is for LLM traces (ADR-0013). Operational logs are a separate concern.
- No UI log emission. The UI is a consumer only. Frontend errors go to the browser console, not to the daemon's log pipeline.
What becomes easier
- Debugging runtime errors without SSH/tmux. The operator opens the log drawer, filters to
error, and sees the actual failure message with context. - Plugin debugging. Plugin authors can emit structured logs that appear in the same drawer as daemon logs, filterable by source.
- Audit trail for operational events. "Did the plugin fire?" is answerable from the log, not just from the compaction result.
- External integration. Flat files are grep-friendly, shippable to any log collector the operator already runs.
What becomes harder
- Storage growth. The
logstable needs retention enforcement. The daemon prunes on boot and periodically (configurable interval). The operator must be aware that increasingmax_entriesorretention_daysincreases DB size. - Migration. All
console.errorcalls need converting. It's mechanical but it touches many files.
Alternatives considered
Alternative A — File logs only, no SQLite
Why tempting: Simpler. No schema change, no new endpoint. Operator greps files.
Why rejected: The UI log drawer needs paginated, filterable, searchable access to logs. Implementing that over flat files means re-implementing a query engine. SQLite already does this. The whole point is to make the UI drawer work.
Alternative B — SQLite only, no flat files
Why tempting: Single source of truth. No dual-write complexity.
Why rejected: Flat files survive DB corruption, are grep-friendly, and are the standard interface for external log collectors. A DROP TABLE logs or a corrupted SQLite file should not be the only way to lose operational history. Dual-sink is worth the minor write overhead.
Alternative C — Use the existing StructuredLogEntry from @kaged/harness
Why tempting: Type already exists in packages/harness/src/types.ts.
Why rejected: That type is harness-scoped (level: "info" | "warn" | "error", no debug, no source, no project_id). The operational log schema needs broader fields. Extend rather than conflate.
Alternative D — Winston / Pino / Bunyan
Why tempting: Battle-tested, feature-rich.
Why rejected: ADR-0004 mandates Bun-native runtime. @kaged/utils/logger.ts already implements the core features (levels, rotation, JSON structured output) using Bun built-ins. Adding an npm logging dependency contradicts the "Bun built-ins first" posture for something this fundamental.
References
- ADR-0004 — Runtime is Bun + TypeScript
- ADR-0005 — Storage default is SQLite
- ADR-0008 — Plugins are subprocesses over JSON-RPC on stdio
- ADR-0013 — Observability substrate is Langfuse (operational logging is the fallback)
- ADR-0023 — Project-plugin lifecycle hooks
- ADR-0024 — Context compaction (the feature that triggered this ADR — the compaction error was invisible without structured logging)