Skip to content

Monitoring

This page covers observability for Sapari in production.

Logging

We use structured logging via Logfire (Pydantic's observability platform). Logs include:

  • Request/response for API calls
  • Task execution for workers
  • Error tracebacks with context
from src.infrastructure.logging import get_logger

logger = get_logger()
logger.info("Processing clip", clip_uuid=str(clip.uuid), status=clip.status)

Metrics

Key metrics to monitor:

Metric What to Watch
API latency (p99) Should be < 500ms for most endpoints
RabbitMQ queue depth Growing queue = workers can't keep up (check per priority level)
Task duration Transcription ~1-2min per minute of video
Error rate Spike = something broken
Storage usage R2 bucket sizes

Alerts

Set up alerts for:

  • API error rate > 1%
  • Task failure rate > 5%
  • RabbitMQ queue depth > 100 for any broker (check via management API at :15672)
  • RabbitMQ consumer count drops to 0 for any queue
  • Database connection pool exhaustion
  • Redis memory > 80% (cache + sessions + result backend)

Tracing

Logfire provides distributed tracing across API requests and background tasks. Each request/task gets a trace ID that links:

  • API handler execution
  • Database queries
  • External API calls (Whisper, LLM)
  • Task queue operations

Per-worker service names

Each process reports a distinct service.name so Logfire UI can filter spans by worker role. The FastAPI web process is sapari-api; each TaskIQ worker gets its own name driven by startup_taskiq_worker(state, service_name):

Process service.name Configured in
FastAPI web sapari-api backend/.envLOGFIRE_SERVICE_NAME
Email worker sapari-email infrastructure/taskiq/app.py:80
Analysis worker sapari-analysis infrastructure/taskiq/app.py:81
Render worker sapari-render infrastructure/taskiq/app.py:82
Download worker sapari-download infrastructure/taskiq/app.py:83
Proxy worker sapari-proxy infrastructure/taskiq/app.py:84
Asset-edit worker sapari-asset-edit infrastructure/taskiq/app.py:85
Scheduler sapari-scheduler infrastructure/taskiq/scheduler.py

Add service.name != 'sapari-api' to any Logfire query to scope to worker traffic. Filter to a specific worker via service.name = 'sapari-render' for per-queue latency investigations.

Span taxonomy (six categories)

Manual spans follow a six-category naming convention so Logfire filters, saved queries, and ad-hoc grouping stay legible. Every new manual span must fit one of these:

Category Pattern Example When to use
Pipeline parent <worker>.pipeline analysis.pipeline, render.pipeline Wraps the full DAG invocation inside a task body. Has project_uuid / export_uuid correlation keys.
Pipeline step step.<step_id> (auto) step.download_clips, step.apply_cuts Emitted automatically by fastroai.LogfireTracer — don't hand-write.
Task taskiq.<task_name> (auto) taskiq.analyze_project, taskiq.cleanup_expired_exports Emitted automatically by LogfireSpanMiddleware — don't hand-write. Cron tasks carry task_type="cron" attribute (see Convention #15).
Service-layer <module>.<method> credits.reserve, entitlement.grant, payment.handle_webhook Wrap business operations that span multiple CRUD calls, so per-operation latency is queryable distinct from any single SQL query.
External call ext.<service>.<op> ext.whisper.transcribe, ext.ffmpeg.render, ext.ytdlp.download, ext.stripe.modify_subscription Wrap every outbound call to a non-owned service so its p99 can be isolated from surrounding app code.
Cron uses taskiq.<task_name> with task_type="cron" attr Saved query: filter task_type = 'cron' to isolate scheduled runs from on-demand queue traffic.
SSE sse.<event> sse.publish, sse.clip_ready Wrap publish paths so event-emission latency is distinct from the originating task.

Attribute conventions. Every ext.* span should carry a domain correlation key at span entry (clip_file_uuid, asset_file_uuid, user_asset_uuid, subscription_id, etc.) so a slow call can be traced back to the originating resource without reading surrounding code. <module>.<method> spans carry the operation's primary identifier (user_id, event_type, project_uuid) plus a post-attr via span.set_attribute() that captures the outcome (reserved=true/false, entitlements_drained=N, payments_materialized=N).

Why manual spans over pure auto-instrumentation. Logfire auto-instruments SQLAlchemy, Redis, Pydantic AI, and FastAPI. Those are always on and produce the "zoom in" view. Manual spans produce the "zoom out" view — the thing a dashboard wants ("p99 of credits.reserve") rather than the raw SQL or Redis calls that composed it. Auto-instrumentation is necessary but not sufficient for app-level SLO work.

Auto-instrumentor defaults

Not every Logfire auto-instrumentor earns its keep. Current defaults:

Auto-instrumentor Web (sapari-api) Workers (sapari-*) Env var Rationale
instrument_pydantic_ai ✅ ON ✅ ON LOGFIRE_INSTRUMENT_PYDANTIC_AI Highest-value auto-signal. LLM calls are the biggest unit cost in the app — every invocation needs token counts, cost, model, prompt hash for debugging + billing reconciliation. Keep on unconditionally.
instrument_sqlalchemy ✅ ON ✅ ON LOGFIRE_INSTRUMENT_SQLALCHEMY / LOGFIRE_INSTRUMENT_SQLALCHEMY_WORKERS Per-query spans let you debug "why is this endpoint p99 slow" without adding manual spans during the incident. Neon's query insights cover aggregate patterns, but don't correlate back to user-visible request latency. Kept on despite span-volume cost because the debugging value wins during incidents.
instrument_redis ❌ OFF ❌ OFF LOGFIRE_INSTRUMENT_REDIS / LOGFIRE_INSTRUMENT_REDIS_WORKERS Redis ops in Sapari are uniformly sub-millisecond (cache, session, rate-limit, pub/sub). Real-world problems surface at the request-span level before they're visible at the Redis-op level. Span volume dominates (Redis is called per request × multiple ops) without proportional signal. Turn on temporarily via env var for Redis-specific investigations.
instrument_fastapi ✅ ON N/A (always on when app is instrumented) One span per HTTP request is the backbone — this is what every other span hangs off.

Rationale for keeping SQLAlchemy despite the cost: when staging hit an OOM + 100% CPU incident (2026-04-23), the time-to-diagnose was dominated by not having per-query visibility into what the hot worker was doing. Neon showed nothing unusual (the bug was in a pubsub spin loop, not a query). SQL spans wouldn't have caught that one specifically, but the adjacent question "is a runaway query involved" is the first thing you ask when CPU pins, and having SQL spans present answers it in 30 seconds instead of 30 minutes. That asymmetry justifies the span volume.

Rationale for turning Redis off: the SSE subscriber spin-loop bug (fixed 2026-04-23) was emitting ~50-100k Redis get_message spans per second per open SSE connection. Redis instrumentation was effectively amplifying an already-pathological CPU pattern with per-call span overhead. Redis ops are sufficiently cheap and uniform that production issues surface at the request-level span first; there's no class of bug where "per-Redis-op span" is the load-bearing signal.

How to flip them temporarily. The env vars map to the on/off toggle. For a focused investigation:

# On the staging or prod server, edit .env:
LOGFIRE_INSTRUMENT_REDIS=true
LOGFIRE_INSTRUMENT_REDIS_WORKERS=true

# Restart backend + affected workers
docker compose up -d sapari-backend sapari-taskiq-<worker>

Remember to flip back after the investigation.

Media proxy observability (R2 Stage 5)

Two instrumented signals feed the R2 Stage 5 dashboards. See OBSERVABILITY.md for the full plan.

Backend mint latency (Logfire)

MediaTokenService.mint() is wrapped in a logfire.span("media_token.mint", ...) with attributes bucket, user_id, kid, expires_in. Span timing captures latency automatically. Exceptions raised inside the span are recorded as the span's error.

Saved queries to create in the Logfire UI (Workspaces → Saved queries):

Query name Filter Purpose
media_token.mint p50/p95/p99 (24h) span_name = 'media_token.mint', time = last 24h Stage 5 mint-latency gate
media_token.mint error rate (24h) span_name = 'media_token.mint' AND status = 'error' Spikes = mint path broken

Plan decision rule (OBSERVABILITY.md Phase 6): if p99 mint < 500ms AND p99 Worker verify < 100ms, drop PROXY_URL.REFRESH_BUFFER_SECONDS from 30 to 15.

Worker reject-reason logs (structured JSON)

/media/v1/* rejections emit single-line JSON to the Worker log stream. Shape:

{"type": "media_verify_rejected", "reason": "expired", "url": "https://staging.sapari.io/media/v1/<redacted>"}

Reasons are an enum (worker/src/constants.ts:RejectReason): - malformed — JWT header couldn't be parsed. - unknown_kidkid missing or not in the active registry. - expired — token expired. Expected on long sessions; frontend refreshes via 401. - invalid_signature — signature verification failed or issuer/claims mismatch. Real attacks land here — investigate if rate exceeds a trickle. - unknown_bucketbkt claim doesn't resolve to a known R2 binding. Config drift. - r2_miss — token verified but R2 has no object at key. Often a deleted-file + dangling-DB-reference bug.

To aggregate by reason on staging:

wrangler tail --env staging --format json \
  | jq -c 'select(.logs[]?.message[0] | fromjson? | .type == "media_verify_rejected")' \
  | jq -c '.logs[].message[0] | fromjson' \
  | jq -s 'group_by(.reason) | map({reason: .[0].reason, count: length})'

Tail for 24h after a deploy, then diff against the cf-analytics-*.md baseline — the sum of counts should ≈ the baseline's 4xx. If a large fraction lands in invalid_signature or r2_miss, root-cause before Stage 6.

Key Files

Component Location
Logging setup backend/src/infrastructure/logging/__init__.py
Logfire config backend/src/infrastructure/observability/__init__.py
Media token span backend/src/infrastructure/media_proxy/service.py:mint
Worker reject enum worker/src/constants.ts:RejectReason
Worker reject log worker/src/media.ts:logRejected

← Deployment Migrations →