Monitoring and observability

Overview

PassAgent tracks service-level objectives (SLOs), request metrics, and infrastructure health through a centralized monitoring system. Metrics feed into SLO computations in real time and are exported to your observability backend (Sentry, Datadog, Prometheus, or StatsD) for dashboards and alerting.

SLO targets

The following targets define the service-level objectives for the PassAgent API.

Objective	Target	Budget
Availability	99.9%	43 minutes of downtime per 30-day window
p95 latency	< 1,200 ms	Measured across all API routes
p99 latency	< 3,000 ms	Measured across all API routes

SLO computations use a 30-day rolling window. The window resets automatically after 30 days. Latency samples are capped at 100,000 entries per window to bound memory usage.

SLI data flow

Every API request feeds a service-level indicator (SLI) data point into the SLO module.

Request completes

Your API route handler finishes execution. The middleware captures the HTTP status code and response duration in milliseconds.

recordRequest() fires

The recordRequest() function in lib/monitoring/metrics.ts emits the http.request.count and http.request.duration metrics with route, method, and status tags.

recordSli() updates the window

The SLO module (lib/monitoring/slo.ts) receives the status and duration. Requests with status >= 500 increment the failure counter. The duration is appended to the latency sample buffer.

Dashboards query checkSloHealth()

The checkSloHealth() function returns current availability, p95/p99 latencies, error budget remaining, and burn rate. Wire this to your dashboard or alerting cron.

Burn-rate alerting

PassAgent implements multi-window burn-rate alerting based on the Google SRE methodology. Burn rate measures how fast you are consuming your monthly error budget relative to a steady-state baseline.

Severity	Burn rate	Time to budget exhaustion	Recommended action
Critical	> 14x	~2 hours	Page on-call immediately
Warning	> 6x	~5 hours	Investigate within 30 minutes
Info	> 1x (with < 50% budget remaining)	Before month end	Review during business hours

How burn rate is calculated

The burn rate is the ratio of actual error consumption per hour to the expected error consumption per hour:

expectedBurnPerHour = errorBudgetTotal / (30 * 24)
actualBurnPerHour   = errorBudgetUsed / windowAgeInHours
burnRate            = actualBurnPerHour / expectedBurnPerHour

A burn rate of 1.0 means you are consuming your error budget at exactly the expected rate. A burn rate of 14 means you will exhaust the entire monthly budget in approximately 2 hours at the current failure rate.

Alert response format

Each burn-rate alert returned by checkBurnRateAlerts() contains:

{
  severity: "critical" | "warning" | "info"
  window: "1h" | "6h" | "30d"
  burnRate: number
  message: string // Includes availability %, failure count, total count
}

Metrics catalog

All metrics are emitted through the Sentry SDK metrics API and are compatible with Datadog, Prometheus, and StatsD backends.

Request metrics

Metric	Type	Tags	Description
`http.request.count`	Counter	`route`, `method`, `status_class`	Total request count by route, HTTP method, and status class (2xx, 4xx, 5xx)
`http.request.duration`	Distribution	`route`, `method`	Request duration in milliseconds
`http.error.count`	Counter	`route`, `status`	Server error count (5xx responses)

Security metrics

Metric	Type	Tags	Description
`rate_limit.hits`	Counter	`action`, `route`	Rate limit enforcement events
`auth.failure`	Counter	`route`, `reason`	Authentication failures (401 responses)
`security.spray_detected`	Counter	`type`, `accounts`	Credential spray detection events (IP spray, password spray, subnet spray)
`security.access_anomaly`	Counter	`subtype`	Access anomaly detection events
`security.anomaly_score`	Distribution	`subtype`	Anomaly detection confidence scores
`security.spray_ip_blocked`	Counter	`type`	IP addresses blocked by spray detection
`ai_guard.event`	Counter	`type`, `category`, `blocked`	AI input/output guard trigger events

Infrastructure metrics

Metric	Type	Tags	Description
`infrastructure.failure`	Counter	`type` (redis, database, automation)	Infrastructure component failures
`db.critical_write_failure`	Counter	`route`, `table`, `operation`	Critical database write failures
`webhook.failure`	Counter	`event_type`, `reason`	Stripe webhook processing failures

Business metrics

Metric	Type	Tags	Description
`reset_flow.success`	Counter	`service`	Password reset automation successes
`reset_flow.failure`	Counter	`service`, `reason`	Password reset automation failures

Route normalization

API routes are normalized before being used as metric tags to prevent high-cardinality label explosion. Dynamic path segments (UUIDs, database IDs) are replaced with :id.

/api/passwords/a1b2c3d4-e5f6-7890  -->  /api/passwords/:id
/api/sharing/abc123def456abcdef01   -->  /api/sharing/:id
/api/passwords                      -->  /api/passwords (unchanged)

Segments are replaced with :id when they are longer than 20 characters or match the pattern of a hex/UUID string ([0-9a-f-]{20,}).

Health check endpoint

Call checkSloHealth() to retrieve a snapshot of all SLO metrics. The response shape is:

{
  availability: {
    current: number    // e.g., 0.999
    target: number     // 0.999
    withinBudget: boolean
  }
  p95: {
    currentMs: number  // e.g., 850
    targetMs: number   // 1200
    withinBudget: boolean
  }
  p99: {
    currentMs: number  // e.g., 2100
    targetMs: number   // 3000
    withinBudget: boolean
  }
  errorBudget: {
    remainingPercent: number  // 0-100
    burnRate: number          // multiplier (1.0 = on track)
  }
  window: {
    totalRequests: number
    failedRequests: number
    windowAgeMs: number
  }
}

Integration with observability backends

Sentry
Datadog
Prometheus
StatsD

PassAgent uses the Sentry SDK (@sentry/nextjs) metrics API by default. Set the SENTRY_DSN or NEXT_PUBLIC_SENTRY_DSN environment variable to enable metric emission. No additional configuration is required.

SENTRY_DSN=https://examplePublicKey@o0.ingest.sentry.io/0

Export Sentry metrics to Prometheus by configuring a Sentry relay with Prometheus scrape endpoints. Alternatively, instrument the recordRequest() call site with a Prometheus client library alongside the existing Sentry calls.

Alerting rules engine

Each record*() function in the metrics module also increments a Redis counter for the server-side alerting rules engine (lib/monitoring/alerting-rules.ts). The /api/cron/check-alerts endpoint evaluates these counters on a schedule. Alert counter keys follow the pattern:

alert:http.error.count
alert:rate_limit.hits
alert:auth.failure
alert:reset_flow.failure
alert:infrastructure.failure.redis
alert:infrastructure.failure.database
alert:security.spray_detected
alert:security.access_anomaly
alert:ai_guard.blocked
alert:webhook.failure
alert:db.critical_write_failure

The metrics module is designed to never throw exceptions. All metric emission is wrapped in try/catch blocks with silent failure. A metrics backend outage will not affect request handling.

Key files

File	Purpose
`lib/monitoring/slo.ts`	SLO computation, burn-rate alerting, sliding window management
`lib/monitoring/metrics.ts`	Metric recording functions, Sentry SDK integration, route normalization
`lib/monitoring/alerting-rules.ts`	Redis-backed alerting counter management
`__tests__/monitoring/slo.test.ts`	Unit tests for SLO computations

Architecture

Operations

Defense

Compliance

Research

Monitoring and observability

Overview

SLO targets

SLI data flow

Burn-rate alerting

Alert response format

Metrics catalog

Request metrics

Security metrics

Infrastructure metrics

Business metrics

Route normalization

Health check endpoint

Integration with observability backends

Alerting rules engine

Key files

Architecture

Operations

Defense

Compliance

Research

​Overview

​SLO targets

​SLI data flow

​Burn-rate alerting

​Alert response format

​Metrics catalog

​Request metrics

​Security metrics

​Infrastructure metrics

​Business metrics

​Route normalization

​Health check endpoint

​Integration with observability backends

​Alerting rules engine

​Key files

Overview

SLO targets

SLI data flow

Burn-rate alerting

Alert response format

Metrics catalog

Request metrics

Security metrics

Infrastructure metrics

Business metrics

Route normalization

Health check endpoint

Integration with observability backends

Alerting rules engine

Key files