Overview
PassAgent tracks service-level objectives (SLOs), request metrics, and infrastructure health through a centralized monitoring system. Metrics feed into SLO computations in real time and are exported to your observability backend (Sentry, Datadog, Prometheus, or StatsD) for dashboards and alerting.SLO targets
The following targets define the service-level objectives for the PassAgent API.| Objective | Target | Budget |
|---|---|---|
| Availability | 99.9% | 43 minutes of downtime per 30-day window |
| p95 latency | < 1,200 ms | Measured across all API routes |
| p99 latency | < 3,000 ms | Measured across all API routes |
SLO computations use a 30-day rolling window. The window resets automatically after 30 days. Latency samples are capped at 100,000 entries per window to bound memory usage.
SLI data flow
Every API request feeds a service-level indicator (SLI) data point into the SLO module.Request completes
Your API route handler finishes execution. The middleware captures the HTTP status code and response duration in milliseconds.
recordRequest() fires
The
recordRequest() function in lib/monitoring/metrics.ts emits the http.request.count and http.request.duration metrics with route, method, and status tags.recordSli() updates the window
The SLO module (
lib/monitoring/slo.ts) receives the status and duration. Requests with status >= 500 increment the failure counter. The duration is appended to the latency sample buffer.Burn-rate alerting
PassAgent implements multi-window burn-rate alerting based on the Google SRE methodology. Burn rate measures how fast you are consuming your monthly error budget relative to a steady-state baseline.| Severity | Burn rate | Time to budget exhaustion | Recommended action |
|---|---|---|---|
| Critical | > 14x | ~2 hours | Page on-call immediately |
| Warning | > 6x | ~5 hours | Investigate within 30 minutes |
| Info | > 1x (with < 50% budget remaining) | Before month end | Review during business hours |
How burn rate is calculated
How burn rate is calculated
The burn rate is the ratio of actual error consumption per hour to the expected error consumption per hour:A burn rate of 1.0 means you are consuming your error budget at exactly the expected rate. A burn rate of 14 means you will exhaust the entire monthly budget in approximately 2 hours at the current failure rate.
Alert response format
Each burn-rate alert returned bycheckBurnRateAlerts() contains:
Metrics catalog
All metrics are emitted through the Sentry SDK metrics API and are compatible with Datadog, Prometheus, and StatsD backends.Request metrics
| Metric | Type | Tags | Description |
|---|---|---|---|
http.request.count | Counter | route, method, status_class | Total request count by route, HTTP method, and status class (2xx, 4xx, 5xx) |
http.request.duration | Distribution | route, method | Request duration in milliseconds |
http.error.count | Counter | route, status | Server error count (5xx responses) |
Security metrics
| Metric | Type | Tags | Description |
|---|---|---|---|
rate_limit.hits | Counter | action, route | Rate limit enforcement events |
auth.failure | Counter | route, reason | Authentication failures (401 responses) |
security.spray_detected | Counter | type, accounts | Credential spray detection events (IP spray, password spray, subnet spray) |
security.access_anomaly | Counter | subtype | Access anomaly detection events |
security.anomaly_score | Distribution | subtype | Anomaly detection confidence scores |
security.spray_ip_blocked | Counter | type | IP addresses blocked by spray detection |
ai_guard.event | Counter | type, category, blocked | AI input/output guard trigger events |
Infrastructure metrics
| Metric | Type | Tags | Description |
|---|---|---|---|
infrastructure.failure | Counter | type (redis, database, automation) | Infrastructure component failures |
db.critical_write_failure | Counter | route, table, operation | Critical database write failures |
webhook.failure | Counter | event_type, reason | Stripe webhook processing failures |
Business metrics
| Metric | Type | Tags | Description |
|---|---|---|---|
reset_flow.success | Counter | service | Password reset automation successes |
reset_flow.failure | Counter | service, reason | Password reset automation failures |
Route normalization
API routes are normalized before being used as metric tags to prevent high-cardinality label explosion. Dynamic path segments (UUIDs, database IDs) are replaced with:id.
Segments are replaced with
:id when they are longer than 20 characters or match the pattern of a hex/UUID string ([0-9a-f-]{20,}).Health check endpoint
CallcheckSloHealth() to retrieve a snapshot of all SLO metrics. The response shape is:
Integration with observability backends
- Sentry
- Datadog
- Prometheus
- StatsD
PassAgent uses the Sentry SDK (
@sentry/nextjs) metrics API by default. Set the SENTRY_DSN or NEXT_PUBLIC_SENTRY_DSN environment variable to enable metric emission. No additional configuration is required.Alerting rules engine
Eachrecord*() function in the metrics module also increments a Redis counter for the server-side alerting rules engine (lib/monitoring/alerting-rules.ts). The /api/cron/check-alerts endpoint evaluates these counters on a schedule.
Alert counter keys follow the pattern:
Key files
| File | Purpose |
|---|---|
lib/monitoring/slo.ts | SLO computation, burn-rate alerting, sliding window management |
lib/monitoring/metrics.ts | Metric recording functions, Sentry SDK integration, route normalization |
lib/monitoring/alerting-rules.ts | Redis-backed alerting counter management |
__tests__/monitoring/slo.test.ts | Unit tests for SLO computations |