ADR-0013: Observability substrate is Langfuse, self-hosted, optional

Status: Accepted
Date: 2026-05-22
Deciders: @karasu
Supersedes: —
Superseded by: —

Context

kaged needs observability for two distinct concerns that often get conflated:

Runtime tracing — what did the model do? Which tools were called? With what arguments? What did they return? How many tokens? What was the latency? This is debugging data; operators need it when an agent run goes sideways.
Prompt management — versioning, editing, A/B testing, and lifecycle management of the prompts that shape agent behavior.

Mainstream LLM observability platforms (Langfuse, LangSmith, Helicone, Arize, Weights & Biases) generally bundle both, plus eval/dataset features. Langfuse is the strongest self-hostable option: MIT-licensed core, Docker Compose deployment, ~21k GitHub stars as of early 2026, framework-agnostic data model, and a direct SDK kaged can own without coupling tracing to Mastra internals.

The question is load-bearing because:

Tracing visibility maps directly to the manifesto principle "the operator can see what the model is doing."
Prompt management touches the manifesto principle "every system prompt is yours."
Adding a required infrastructure component (Langfuse stack: Postgres + ClickHouse + Redis + S3) is a heavy ask for operators who just want to spin up kaged on a Pi.

The decision needs to land observability properly without either (a) reinventing it from scratch or (b) forcing operators to deploy multi-container infra they don't want.

Decision

Langfuse is the recommended observability platform for kaged. Langfuse is OPTIONAL — kaged runs without it. When configured, kaged exports traces to a self-hosted Langfuse via Mastra's native @mastra/observability + @mastra/langfuse pipeline, which produces proper hierarchical traces (agent → generation → tool → generation) automatically. kaged does NOT maintain a manual Langfuse SDK integration — the Mastra observability layer handles span creation, nesting, and metadata enrichment. Prompt management does NOT use Langfuse; prompts live in the project config (file-watcher driven hot-reload), and operators version control them via git as they see fit.

Two halves to this decision, kept distinct on purpose:

Half 1 — Tracing: Langfuse, self-hosted, opt-in

Operators who want LLM tracing run Langfuse themselves (Docker Compose, on the same homelab as kaged, or a different host — operator's call).
kaged uses Mastra's native observability pipeline (@mastra/observability with @mastra/langfuse exporter). The daemon initializes a Mastra instance with an Observability config pointing to the operator's Langfuse instance. Before each agent run, agent.__registerMastra(mastra) injects the observability context into the agent. Mastra then automatically creates correctly nested spans: invoke_agent (AGENT) → chat model (GENERATION) → model_step → tool execution spans → additional generation spans for multi-step runs. No manual span/generation/trace creation in kaged code.
When Langfuse is not configured, kaged runs fine without observability — it just won't have rich post-hoc debugging.
kaged ships sane defaults if Langfuse is not configured: structured logs to stdout (and to the kaged log viewer in the UI), enough to debug most runs without external infra.
Self-hosted only as a recommendation. Operators who prefer Langfuse Cloud or any other observability platform compatible with Mastra's exporter can choose that — kaged does not forbid it, it just doesn't ship integrations to alternatives.

Half 2 — Prompt management: project config, file-watched

All prompts that shape agent behavior — primary agent system prompt, subagent system prompts, tool descriptions kaged overrides, kaged-internal prompts — live in the project config (or files referenced from it).
kaged's daemon watches the project config and any referenced prompt files. On change, the affected primary/subagent reloads its system prompt at the next message boundary. No restart required.
The operator versions their project config and prompts using whatever tool they like — git is the obvious choice but kaged does not enforce it.
kaged does not stand up a prompt-management database, a UI editor with versioning, or any other prompt-lifecycle infrastructure. That's heavy and duplicates what every operator already has (a file system + git).

Consequences

What this commits us to

A file-watcher subsystem in the daemon, watching the project config and referenced prompt files. Probably chokidar or equivalent; trivial to implement.
Documented hot-reload semantics: when a system prompt changes, when does the running agent pick it up? Answer (provisional): at the next message boundary, never mid-generation. Spec to be written.
A clean integration point in kaged config for Langfuse credentials. If unset, no tracing exporter is registered. When set, a Mastra instance with Observability is initialized and injected into each agent run via __registerMastra.
Documentation that walks operators through self-hosting Langfuse on Docker Compose, including the recommended way to run it alongside kaged. Not as an install step, but as a reference doc for those who want it.

What this forecloses

Operators cannot edit prompts through a hosted UI (Langfuse Cloud, Langfuse self-hosted, Mastra Studio, or any other). Prompts are files. The operator's editor edits them.
We won't ship A/B testing of prompts as a first-party feature. If the operator wants to A/B, they version the prompt file, run two project configs, compare traces in Langfuse.
No first-party prompt-management database means no centralized prompt audit log inside kaged. The git log is the audit log.

What becomes easier

Operators can install kaged and use it with zero observability infrastructure. The barrier to first agent run is low.
Prompt edits are file edits. No API call, no UI step, no signup, no propagation lag. The operator's editor of choice (Vim, VS Code, Helix, whatever) is the prompt editor.
Self-hosting story is honest: "kaged is self-hosted. Langfuse is self-hosted. They are separate deployments." No hosted-service entanglement.
The manifesto principle "every prompt is yours" is delivered most-literally — they're literally files in your filesystem.

What becomes harder

Operators who want Langfuse have to deploy a multi-container stack themselves. We document it, but we don't automate it (yet). A future plugin/preset could ship a kaged-bundled Langfuse Compose file as a one-click install.
Prompts referenced from project config need clear path resolution semantics — absolute, relative to project, relative to kaged data dir. To be specified in specs/project-dsl.md.
File-watcher edge cases (atomic writes, editor-induced rename-and-replace, partial writes during save) need handling. Solved problems but real engineering, not zero.

Why not Langfuse for prompt management

This is the substantive part. Langfuse can manage prompts — versioning, A/B routing, prompt-as-API. It's a real feature. Reasons to not use it for kaged anyway:

It forces an install. To use the prompt-management feature, the operator must run Langfuse. Even if Langfuse is otherwise optional for tracing, gating prompt edits behind it makes Langfuse de-facto mandatory. That contradicts Half 1's "Langfuse is optional."
The git log is already the audit log. Operators on this product are running self-hosted infra; they have git. A separate audit history of prompt changes is duplicative.
File-watched config is the lowest-friction edit path. No round-trip to a UI, no API call, no propagation lag. The editor is the prompt manager.
Operators already version control their project config. Prompts are part of the project. They version together. Splitting prompts into a separate database fragments the unit of operator thinking.
The kaged manifesto reads "we don't force shit." Forcing a Postgres + ClickHouse + Redis + S3 stack to edit a system prompt is the platonic ideal of forcing shit.

This does not preclude an optional future plugin that syncs prompts to/from Langfuse for operators who want that workflow. It just means the default — and the spec-supported path — is files on disk.

Why Langfuse and not LangSmith / Helicone / Arize / Weave

LangSmith — proprietary core, LangChain-first. We're not using LangChain (see ADR-0012).
Helicone — proxy-based, requires routing LLM traffic through it. kaged makes provider calls directly via Mastra; proxy-based doesn't fit.
Arize / Weights & Biases — enterprise focus, less aligned with self-hosted homelab posture.
Weave (W&B) — SDK-based and capable, but tied to the W&B ecosystem.

Langfuse: MIT core, self-hostable, framework-agnostic, native Mastra integration via @mastra/langfuse exporter with automatic hierarchical tracing, strongest match for the kaged posture.

Consequences for ADR-0012

Tracing is emitted via Mastra's native observability pipeline (@mastra/observability + @mastra/langfuse exporter). kaged does not maintain a separate direct Langfuse SDK integration — Mastra handles span hierarchy, nesting, and metadata enrichment automatically when the agent is registered with __registerMastra().

This means kaged's tracing transport is now coupled to Mastra's observability API surface. If Mastra changes its observability internals between versions, kaged inherits that change. This is an acceptable tradeoff: Mastra's integrated tracing produces correct hierarchical spans with tool definitions, usage metadata, and multi-step visibility — features that would require substantial manual code to replicate with the direct SDK. Using the platform's built-in capability is preferred over maintaining a parallel manual integration (see "Reinforced principle" in Amendments).

Alternatives considered

Alternative A — Bundle Langfuse into the kaged install

Why tempting: Zero-friction observability. Operators get tracing automatically.

Why rejected: Heavy. Langfuse is Postgres + ClickHouse + Redis + S3 + two app containers. Forcing this onto every kaged install (including operators who just want to run on a single Pi) contradicts "runs on your Pi" and "we don't force shit." Better to make it opt-in.

Alternative B — Build kaged-native observability from scratch

Why tempting: No external dependency. Full control over the data model.

Why rejected: Reinventing the wheel for no manifesto-relevant gain. Langfuse is solved. Kaged should build the things that don't exist (DSL routing, sandbox supervision, mobile-first UI), not duplicate what does.

Alternative C — Use Langfuse for both tracing and prompt management

Why tempting: Single platform for both observability concerns.

Why rejected: See "Why not Langfuse for prompt management" above. Forces operators into a heavy install for a feature that file-watched config delivers for free.

Alternative D — Sentry / Datadog / generic APM

Why tempting: Operators may already run one.

Why rejected: LLM-specific signals (token counts, prompt versions, model selection, tool call sequences, generation latency vs. tool latency) are not first-class in general-purpose APM. Operators who want both can run both — kaged's structured logs work fine with Loki, Datadog, etc. — but kaged's recommended observability is LLM-aware.

References

ADR-0011 — project portability (prompts in project config supports portability)
ADR-0012 — Mastra is the substrate; tracing remains kaged-owned at the harness boundary
Langfuse: https://langfuse.com
Langfuse GitHub: https://github.com/langfuse/langfuse
Langfuse self-hosting docs: https://langfuse.com/self-hosting
Mastra + Langfuse integration: https://langfuse.com/integrations/frameworks/mastra
Prior operational experience running Langfuse in a customer-support agent project

Amendments

2026-06-02 — Tracing migrated from direct Langfuse SDK to Mastra native observability

What changed: kaged removed its manual Langfuse SDK integration (langfuse npm package, manual trace/generation/span creation in runtime.ts) and replaced it with Mastra's native observability pipeline (@mastra/observability + @mastra/langfuse).

Why: The manual integration produced flat, partial traces — a single generation span per run, no tool definitions in metadata, no proper nesting between agent calls and tool execution, silent flush errors, and missing provider/model metadata. Mastra's integrated pipeline produces correct hierarchical traces automatically: invoke_agent → chat model → model_step → tool execution → subsequent generation spans for multi-step runs. Tool definitions, usage metadata, and provider information are all enriched by Mastra without manual code.

Reinforced principle: When a platform provides a built-in, well-maintained integration for a capability (observability, auth, streaming), kaged uses it rather than maintaining a parallel manual implementation. This reduces maintenance burden, ensures correct behavior across Mastra version upgrades, and avoids the class of bugs where manual tracing drifts from the actual execution flow. The same principle applies to using @kaged/llm's LanguageModelV2 shim instead of @ai-sdk/<provider> packages directly (ADR-0014): use the integrated path, don't reinvent alongside it.

2026-06-25 — ADR-0014 example above is inverted by ADR-0049

The "Reinforced principle" sidebar above (2026-06-02) used ADR-0014 as its illustrative example: "use the integrated @kaged/llm shim instead of @ai-sdk/<provider> packages directly." ADR-0049 inverts that example. Post-ADR-0049, kaged does use @ai-sdk/<provider> packages directly — loaded dynamically from $KAGED_HOME/providers per the catalog — and the hand-adapter @kaged/llm shim is deleted as the six hand adapters + their parsers are removed.

The principle itself survives — and is in fact reinforced by the inversion. ADR-0013's principle is "use the maintained integration, don't reinvent alongside it." For Langfuse (the ADR-0013 case), the maintained integration is Mastra's observability pipeline — that hasn't changed. For LLM provider wire protocols (the ADR-0014 case), the maintained integration is the @ai-sdk/* package family — which is exactly what ADR-0049 adopts. The 2026-06-02 sidebar was right about the principle and wrong about which direction the LLM case would resolve; ADR-0049 corrects the direction, the principle stands.

No code or decision change in this ADR — this amendment records the inversion for institutional memory.

Implementation: initMastraObservability() in packages/harness/src/observability.ts creates a Mastra instance with Observability config + LangfuseExporter. agent.__registerMastra(mastra) is called before agent.stream() in runtime.ts. The langfuse direct dependency was removed from @kaged/harness. All 3812 tests pass.