Skip to main content

Overview

PassAgent tracks service-level objectives (SLOs), request metrics, and infrastructure health through a centralized monitoring system. Metrics feed into SLO computations in real time and are exported to your observability backend (Sentry, Datadog, Prometheus, or StatsD) for dashboards and alerting.

SLO targets

The following targets define the service-level objectives for the PassAgent API.
ObjectiveTargetBudget
Availability99.9%43 minutes of downtime per 30-day window
p95 latency< 1,200 msMeasured across all API routes
p99 latency< 3,000 msMeasured across all API routes
SLO computations use a 30-day rolling window. The window resets automatically after 30 days. Latency samples are capped at 100,000 entries per window to bound memory usage.

SLI data flow

Every API request feeds a service-level indicator (SLI) data point into the SLO module.
1

Request completes

Your API route handler finishes execution. The middleware captures the HTTP status code and response duration in milliseconds.
2

recordRequest() fires

The recordRequest() function in lib/monitoring/metrics.ts emits the http.request.count and http.request.duration metrics with route, method, and status tags.
3

recordSli() updates the window

The SLO module (lib/monitoring/slo.ts) receives the status and duration. Requests with status >= 500 increment the failure counter. The duration is appended to the latency sample buffer.
4

Dashboards query checkSloHealth()

The checkSloHealth() function returns current availability, p95/p99 latencies, error budget remaining, and burn rate. Wire this to your dashboard or alerting cron.

Burn-rate alerting

PassAgent implements multi-window burn-rate alerting based on the Google SRE methodology. Burn rate measures how fast you are consuming your monthly error budget relative to a steady-state baseline.
SeverityBurn rateTime to budget exhaustionRecommended action
Critical> 14x~2 hoursPage on-call immediately
Warning> 6x~5 hoursInvestigate within 30 minutes
Info> 1x (with < 50% budget remaining)Before month endReview during business hours
The burn rate is the ratio of actual error consumption per hour to the expected error consumption per hour:
expectedBurnPerHour = errorBudgetTotal / (30 * 24)
actualBurnPerHour   = errorBudgetUsed / windowAgeInHours
burnRate            = actualBurnPerHour / expectedBurnPerHour
A burn rate of 1.0 means you are consuming your error budget at exactly the expected rate. A burn rate of 14 means you will exhaust the entire monthly budget in approximately 2 hours at the current failure rate.

Alert response format

Each burn-rate alert returned by checkBurnRateAlerts() contains:
{
  severity: "critical" | "warning" | "info"
  window: "1h" | "6h" | "30d"
  burnRate: number
  message: string // Includes availability %, failure count, total count
}

Metrics catalog

All metrics are emitted through the Sentry SDK metrics API and are compatible with Datadog, Prometheus, and StatsD backends.

Request metrics

MetricTypeTagsDescription
http.request.countCounterroute, method, status_classTotal request count by route, HTTP method, and status class (2xx, 4xx, 5xx)
http.request.durationDistributionroute, methodRequest duration in milliseconds
http.error.countCounterroute, statusServer error count (5xx responses)

Security metrics

MetricTypeTagsDescription
rate_limit.hitsCounteraction, routeRate limit enforcement events
auth.failureCounterroute, reasonAuthentication failures (401 responses)
security.spray_detectedCountertype, accountsCredential spray detection events (IP spray, password spray, subnet spray)
security.access_anomalyCountersubtypeAccess anomaly detection events
security.anomaly_scoreDistributionsubtypeAnomaly detection confidence scores
security.spray_ip_blockedCountertypeIP addresses blocked by spray detection
ai_guard.eventCountertype, category, blockedAI input/output guard trigger events

Infrastructure metrics

MetricTypeTagsDescription
infrastructure.failureCountertype (redis, database, automation)Infrastructure component failures
db.critical_write_failureCounterroute, table, operationCritical database write failures
webhook.failureCounterevent_type, reasonStripe webhook processing failures

Business metrics

MetricTypeTagsDescription
reset_flow.successCounterservicePassword reset automation successes
reset_flow.failureCounterservice, reasonPassword reset automation failures

Route normalization

API routes are normalized before being used as metric tags to prevent high-cardinality label explosion. Dynamic path segments (UUIDs, database IDs) are replaced with :id.
/api/passwords/a1b2c3d4-e5f6-7890  -->  /api/passwords/:id
/api/sharing/abc123def456abcdef01   -->  /api/sharing/:id
/api/passwords                      -->  /api/passwords (unchanged)
Segments are replaced with :id when they are longer than 20 characters or match the pattern of a hex/UUID string ([0-9a-f-]{20,}).

Health check endpoint

Call checkSloHealth() to retrieve a snapshot of all SLO metrics. The response shape is:
{
  availability: {
    current: number    // e.g., 0.999
    target: number     // 0.999
    withinBudget: boolean
  }
  p95: {
    currentMs: number  // e.g., 850
    targetMs: number   // 1200
    withinBudget: boolean
  }
  p99: {
    currentMs: number  // e.g., 2100
    targetMs: number   // 3000
    withinBudget: boolean
  }
  errorBudget: {
    remainingPercent: number  // 0-100
    burnRate: number          // multiplier (1.0 = on track)
  }
  window: {
    totalRequests: number
    failedRequests: number
    windowAgeMs: number
  }
}

Integration with observability backends

PassAgent uses the Sentry SDK (@sentry/nextjs) metrics API by default. Set the SENTRY_DSN or NEXT_PUBLIC_SENTRY_DSN environment variable to enable metric emission. No additional configuration is required.
SENTRY_DSN=https://examplePublicKey@o0.ingest.sentry.io/0

Alerting rules engine

Each record*() function in the metrics module also increments a Redis counter for the server-side alerting rules engine (lib/monitoring/alerting-rules.ts). The /api/cron/check-alerts endpoint evaluates these counters on a schedule. Alert counter keys follow the pattern:
alert:http.error.count
alert:rate_limit.hits
alert:auth.failure
alert:reset_flow.failure
alert:infrastructure.failure.redis
alert:infrastructure.failure.database
alert:security.spray_detected
alert:security.access_anomaly
alert:ai_guard.blocked
alert:webhook.failure
alert:db.critical_write_failure
The metrics module is designed to never throw exceptions. All metric emission is wrapped in try/catch blocks with silent failure. A metrics backend outage will not affect request handling.

Key files

FilePurpose
lib/monitoring/slo.tsSLO computation, burn-rate alerting, sliding window management
lib/monitoring/metrics.tsMetric recording functions, Sentry SDK integration, route normalization
lib/monitoring/alerting-rules.tsRedis-backed alerting counter management
__tests__/monitoring/slo.test.tsUnit tests for SLO computations