HTTP REST API - Monitoring, Metrics & Alerting Guide
HTTP REST API - Monitoring, Metrics & Alerting Guide
1. HTTP API Health and User Experience Indicators
What Metrics Are Useful?
Request Metrics
- Request Rate: Requests per second/minute (throughput)
- Request Duration: Response time percentiles (p50, p95, p99, p999)
- Request Size: Payload sizes (request/response bytes)
- Request Count by Status: HTTP status code distribution (2xx, 4xx, 5xx)
Error Metrics
- Error Rate: Percentage of failed requests (4xx + 5xx)
- Error Count by Type: Breakdown by status code (404, 500, 503, etc.)
- Error Rate by Endpoint: Which endpoints are failing most
Availability Metrics
- Uptime: Service availability percentage
- Health Check Status: Internal health endpoint status
- Dependency Health: Database, cache, external API availability
Performance Metrics
- Latency Percentiles: p50, p95, p99, p999 response times
- Slow Requests: Requests exceeding thresholds (e.g., >1s, >5s)
- Timeout Rate: Requests timing out
- Throughput: Successful requests per second
Resource Metrics
- CPU Usage: CPU utilization percentage
- Memory Usage: Heap, non-heap, off-heap memory
- Thread Pool: Active threads, queue size, rejections
- Connection Pool: Active connections, idle connections, wait time
- GC Metrics: GC frequency, duration, pause times
Business Metrics
- Active Users: Concurrent or active users
- API Usage by Client: Mobile vs web client distribution
- Feature Usage: Endpoint popularity, feature adoption
What Indicates This System is Healthy?
Green Signals (Healthy System)
-
High Availability
- Uptime > 99.9% (less than 43 minutes downtime/month)
- Health checks passing consistently
- All critical dependencies available
-
Good Performance
- p95 latency < SLA threshold (e.g., <500ms for most endpoints)
- p99 latency < 2x p95 (no tail latency issues)
- Error rate < 0.1% (99.9% success rate)
- No timeout errors
-
Stable Resource Usage
- CPU usage < 70% average, < 90% peak
- Memory usage stable (no memory leaks)
- GC pause times < 100ms, frequency reasonable
- Thread pool utilization < 80%
- Connection pool healthy (no connection exhaustion)
-
Consistent Throughput
- Request rate within expected range
- No sudden spikes or drops
- Graceful handling of traffic patterns
-
Low Error Rates
- 4xx errors < 1% (mostly client errors)
- 5xx errors < 0.1% (server errors rare)
- No error spikes or cascading failures
What Tells You Users Are Having a Bad Experience?
Red Flags (Poor User Experience)
-
High Latency
- p95 latency > 1 second (users notice delay)
- p99 latency > 5 seconds (users frustrated)
- Tail latency spikes (some users hit very slow requests)
- Increasing latency trends over time
-
High Error Rates
- Error rate > 1% (users seeing failures)
- 5xx errors > 0.5% (server-side failures)
- Specific endpoints failing (feature broken)
- Error rate increasing over time
-
Timeouts
- Timeout rate > 0.1% (requests not completing)
- Increasing timeout frequency
- Timeouts concentrated on specific endpoints
-
Availability Issues
- Health checks failing intermittently
- Service unavailable errors (503)
- Circuit breakers opening
- Degraded mode activations
-
Resource Exhaustion
- Thread pool rejections (requests queued too long)
- Connection pool exhaustion
- Out of memory errors
- CPU throttling
-
Client-Specific Issues
- Mobile clients experiencing higher error rates
- Specific geographic regions with issues
- Certain user segments affected
What Helps You Understand Capacity and Scaling Needs?
Capacity Indicators
-
Throughput Trends
- Request rate growth over time
- Peak vs average traffic patterns
- Traffic growth rate (week-over-week, month-over-month)
-
Resource Utilization Trends
- CPU usage trending upward
- Memory usage increasing
- Connection pool approaching limits
- Thread pool utilization increasing
-
Performance Degradation Under Load
- Latency increasing with load
- Error rate increasing with load
- Resource saturation points
- Breaking points (when system fails)
-
Saturation Metrics
- Queue lengths increasing
- Wait times increasing
- Resource contention
- Backpressure indicators
-
Scaling Triggers
- Approaching resource limits (CPU > 70%, Memory > 80%)
- Latency degrading under load
- Error rate increasing with traffic
- Approaching connection/thread limits
2. Design Prometheus Metrics
Metric Types
Counter
- Use for: Cumulative values that only increase
- Examples: Total requests, total errors, total bytes transferred
- Operations:
rate(),increase(),irate()
Gauge
- Use for: Values that can go up or down
- Examples: Current CPU usage, active connections, queue size, memory usage
- Operations: Direct value,
delta(),deriv()
Histogram
- Use for: Distribution of measurements (latency, sizes)
- Examples: Request duration, response size, payload size
- Operations:
histogram_quantile(),rate() - Benefits: Captures percentiles, distribution shape
Summary
- Use for: Pre-computed quantiles (less common)
- Examples: Similar to histogram but with pre-computed quantiles
- Note: Less flexible than histogram, rarely used
Metric Naming Conventions
Prometheus Naming Rules
- Use
snake_case(not camelCase) - Use base unit (seconds, bytes, not milliseconds, kilobytes)
- Include metric type suffix:
_total,_count,_sum,_bucket,_bytes,_seconds - Use descriptive names:
http_requests_totalnotrequests
Naming Patterns
<namespace>_<subsystem>_<metric_name>_<unit>_<type_suffix>
Examples:
http_requests_total(Counter)http_request_duration_seconds(Histogram)http_request_size_bytes(Histogram)http_active_connections(Gauge)jvm_memory_used_bytes(Gauge)jvm_gc_duration_seconds(Summary/Histogram)
Labels/Dimensions
Essential Labels
method: HTTP method (GET, POST, PUT, DELETE)endpoint: API endpoint path (normalized, e.g.,/api/v1/users/{id})status: HTTP status code (200, 404, 500, etc.)status_class: Status class (2xx, 4xx, 5xx)
Useful Labels
client_type: Client type (mobile, web, api)user_agent: User agent category (browser, app version)region: Geographic regionenvironment: Environment (prod, staging, dev)instance: Server instance identifierversion: API version (v1, v2)
Cardinality Considerations
- Low Cardinality: status_class (3 values), method (4-5 values), client_type (2-3 values)
- Medium Cardinality: endpoint (10-100 values), status (10-20 values)
- High Cardinality: user_agent (1000s), instance (10-100s), full endpoint with IDs (unbounded)
Cardinality Risk
What is Cardinality?
Cardinality = number of unique time series = product of all label value combinations
Example:
http_requests_total{method="GET", endpoint="/users", status="200"} = 1000
http_requests_total{method="GET", endpoint="/users", status="404"} = 50
http_requests_total{method="POST", endpoint="/users", status="201"} = 200
If you have:
- 5 methods × 50 endpoints × 10 status codes = 2,500 time series
- Add client_type (3) = 7,500 time series
- Add instance (10) = 75,000 time series
Cardinality Risks
- High Memory Usage: Each time series consumes memory
- Query Performance: High cardinality slows queries
- Storage Costs: More time series = more storage
- Prometheus Limits: Default limit ~1M active time series
Best Practices
-
Avoid High-Cardinality Labels
- Don't use user IDs, session IDs, request IDs as labels
- Don't use full URLs with dynamic parameters
- Use normalized endpoint paths
-
Use Label Filtering
- Filter out high-cardinality labels in recording rules
- Use
keep()ordrop()relabel configs
-
Limit Label Combinations
- Only include labels that add value
- Remove labels that don't help with alerting/querying
-
Monitor Cardinality
- Track
prometheus_tsdb_head_seriesmetric - Alert on high cardinality
- Track
Example Metric Definitions
Request Counter
# Type: Counter
# Name: http_requests_total
# Labels: method, endpoint, status_class, status
# Cardinality: ~500 (5 methods × 50 endpoints × 2 status_classes)
http_requests_total{method="GET", endpoint="/api/v1/users", status_class="2xx", status="200"} 15234
Request Duration Histogram
# Type: Histogram
# Name: http_request_duration_seconds
# Labels: method, endpoint, status_class
# Buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
# Cardinality: ~250 (5 methods × 50 endpoints × 1 status_class × 11 buckets)
http_request_duration_seconds_bucket{method="GET", endpoint="/api/v1/users", status_class="2xx", le="0.1"} 14500
http_request_duration_seconds_bucket{method="GET", endpoint="/api/v1/users", status_class="2xx", le="0.5"} 15100
http_request_duration_seconds_bucket{method="GET", endpoint="/api/v1/users", status_class="2xx", le="+Inf"} 15234
http_request_duration_seconds_sum{method="GET", endpoint="/api/v1/users", status_class="2xx"} 1234.56
http_request_duration_seconds_count{method="GET", endpoint="/api/v1/users", status_class="2xx"} 15234
Active Connections Gauge
# Type: Gauge
# Name: http_active_connections
# Labels: instance
# Cardinality: ~10 (number of instances)
http_active_connections{instance="api-server-1"} 45
Error Rate (Derived)
# Derived metric using rate() and division
# rate(http_requests_total{status_class="5xx"}[5m]) / rate(http_requests_total[5m])
3. Define Alerts
What Deserves to Wake Someone Up at 3am?
Critical Alerts (Page On-Call)
These indicate immediate user impact or service degradation:
-
Service Down
- Service completely unavailable
- Health checks failing across all instances
- All endpoints returning 5xx errors
-
High Error Rate
- 5xx error rate > 1% for 5+ minutes
- Critical endpoints failing (>10% error rate)
- Cascading failures detected
-
Complete Outage
- Database unavailable
- All instances down
- Critical dependency failure
-
Security Incident
- Unusual authentication failures spike
- Potential DDoS attack
- Unauthorized access attempts
-
Data Loss Risk
- Database connection pool exhausted
- Write failures to persistent storage
- Replication lag critical
Warning Alerts (Investigate During Business Hours)
These indicate potential issues or degradation:
-
Elevated Error Rate
- 5xx error rate > 0.5% for 15+ minutes
- Specific endpoints showing errors
- Error rate trending upward
-
Performance Degradation
- p95 latency > SLA threshold
- Latency trending upward
- Slow query detection
-
Resource Pressure
- CPU usage > 80% for 15+ minutes
- Memory usage > 85%
- Connection pool > 80% utilized
-
Capacity Concerns
- Approaching resource limits
- Traffic spike detected
- Queue lengths increasing
Alert Thresholds
Error Rate Alerts
Critical: High 5xx Error Rate
# Alert when 5xx error rate exceeds 1% for 5 minutes
(
rate(http_requests_total{status_class="5xx"}[5m])
/
rate(http_requests_total[5m])
) > 0.01
- Threshold: > 1% (1 in 100 requests failing)
- Duration: 5 minutes
- Severity: Critical
Warning: Elevated 5xx Error Rate
# Alert when 5xx error rate exceeds 0.5% for 15 minutes
(
rate(http_requests_total{status_class="5xx"}[5m])
/
rate(http_requests_total[5m])
) > 0.005
- Threshold: > 0.5% (1 in 200 requests failing)
- Duration: 15 minutes
- Severity: Warning
Critical: Endpoint-Specific Failures
# Alert when specific critical endpoint has >10% error rate
(
rate(http_requests_total{endpoint="/api/v1/payments", status_class="5xx"}[5m])
/
rate(http_requests_total{endpoint="/api/v1/payments"}[5m])
) > 0.10
- Threshold: > 10% for critical endpoints
- Duration: 5 minutes
- Severity: Critical
Latency Alerts
Critical: High p95 Latency
# Alert when p95 latency exceeds 2 seconds for 5 minutes
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 2
- Threshold: p95 > 2 seconds
- Duration: 5 minutes
- Severity: Critical
Warning: Elevated p95 Latency
# Alert when p95 latency exceeds 1 second for 15 minutes
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 1
- Threshold: p95 > 1 second
- Duration: 15 minutes
- Severity: Warning
Critical: Tail Latency (p99)
# Alert when p99 latency exceeds 5 seconds for 5 minutes
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) > 5
- Threshold: p99 > 5 seconds
- Duration: 5 minutes
- Severity: Critical
Availability Alerts
Critical: Service Down
# Alert when health check fails
up{job="api-server"} == 0
- Threshold: Health check down
- Duration: 2 minutes (immediate)
- Severity: Critical
Critical: High Unavailability
# Alert when >50% of instances are down
avg(up{job="api-server"}) < 0.5
- Threshold: < 50% instances available
- Duration: 2 minutes
- Severity: Critical
Resource Alerts
Warning: High CPU Usage
# Alert when CPU usage exceeds 80% for 15 minutes
100 - (avg(irate(process_cpu_seconds_total[5m])) * 100) < 20
# Or using node_exporter:
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
- Threshold: > 80% CPU
- Duration: 15 minutes
- Severity: Warning
Critical: Very High CPU Usage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
- Threshold: > 95% CPU
- Duration: 5 minutes
- Severity: Critical
Warning: High Memory Usage
# Alert when memory usage exceeds 85%
(jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}) > 0.85
- Threshold: > 85% memory
- Duration: 15 minutes
- Severity: Warning
Critical: Memory Exhaustion Risk
(jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}) > 0.95
- Threshold: > 95% memory
- Duration: 5 minutes
- Severity: Critical
Warning: Thread Pool Exhaustion
# Alert when thread pool utilization exceeds 80%
(thread_pool_active_threads / thread_pool_max_threads) > 0.80
- Threshold: > 80% thread pool utilization
- Duration: 10 minutes
- Severity: Warning
Critical: Thread Pool Rejections
# Alert when thread pool rejections occur
rate(thread_pool_rejected_tasks_total[5m]) > 0
- Threshold: Any rejections
- Duration: 5 minutes
- Severity: Critical
Capacity Alerts
Warning: High Request Rate
# Alert when request rate exceeds expected capacity (e.g., 1000 req/s)
rate(http_requests_total[5m]) > 1000
- Threshold: > 1000 requests/second
- Duration: 15 minutes
- Severity: Warning
Warning: Connection Pool Pressure
# Alert when connection pool utilization exceeds 80%
(connection_pool_active / connection_pool_max) > 0.80
- Threshold: > 80% connection pool utilization
- Duration: 15 minutes
- Severity: Warning
Alert Duration (Avoiding Flapping)
Flapping Prevention Strategies
-
Use Appropriate Windows
- Critical alerts: 2-5 minutes (fast response needed)
- Warning alerts: 10-15 minutes (avoid noise)
- Capacity alerts: 15-30 minutes (trends, not spikes)
-
Use Rate Functions
rate()over 5-minute windows smooths spikesavg_over_time()for additional smoothing- Avoid instant values that fluctuate
-
Hysteresis
- Different thresholds for alerting vs recovery
- Example: Alert at >80%, recover at <70%
-
Grouping and Aggregation
- Aggregate across instances to avoid per-instance noise
- Use
avg(),max(), orsum()appropriately
Example: Flapping Prevention
# Bad: Instant value, will flap
http_requests_total{status="500"} > 100
# Good: Rate over time window
rate(http_requests_total{status="500"}[5m]) > 10
# Better: Rate with percentage threshold
(
rate(http_requests_total{status_class="5xx"}[5m])
/
rate(http_requests_total[5m])
) > 0.01
Warning vs Critical Alerts
Critical Alerts
- Purpose: Immediate action required
- Impact: Users affected right now
- Response: Page on-call, immediate investigation
- Examples:
- Service down
- High error rate (>1%)
- Complete outage
- Security incident
- Data loss risk
Warning Alerts
- Purpose: Investigate during business hours
- Impact: Potential issues, degradation
- Response: Review during next business day
- Examples:
- Elevated error rate (0.5-1%)
- Performance degradation
- Resource pressure
- Capacity concerns
- Trending issues
Alert Escalation
- Warning → Monitor for 15-30 minutes
- If persists or worsens → Escalate to Critical
- Critical → Page on-call immediately
Complete Alert Rule Example (Prometheus)
groups:
- name: http_api_critical
interval: 30s
rules:
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status_class="5xx"}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High 5xx error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
- alert: ServiceDown
expr: up{job="api-server"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "API service is down"
description: "Instance {{ $labels.instance }} has been down for more than 2 minutes"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 2
for: 5m
labels:
severity: critical
annotations:
summary: "High p95 latency detected"
description: "p95 latency is {{ $value }}s (threshold: 2s)"
- name: http_api_warning
interval: 30s
rules:
- alert: ElevatedErrorRate
expr: |
(
sum(rate(http_requests_total{status_class="5xx"}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.005
for: 15m
labels:
severity: warning
annotations:
summary: "Elevated 5xx error rate"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 0.5%)"
- alert: HighCPUUsage
expr: |
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 15m
labels:
severity: warning
annotations:
summary: "High CPU usage"
description: "CPU usage is {{ $value }}% (threshold: 80%)"
Summary
Key Takeaways
-
Monitor the Four Golden Signals
- Latency (response time)
- Traffic (request rate)
- Errors (error rate)
- Saturation (resource usage)
-
Use Appropriate Metric Types
- Counters for cumulative values
- Gauges for current state
- Histograms for distributions (latency, sizes)
-
Control Cardinality
- Avoid high-cardinality labels (user IDs, request IDs)
- Use normalized endpoint paths
- Monitor cardinality growth
-
Design Alerts Carefully
- Critical: Immediate user impact (page on-call)
- Warning: Potential issues (investigate during business hours)
- Use appropriate durations to avoid flapping
- Set thresholds based on SLA and user experience
-
Focus on User Experience
- High latency = bad UX
- High error rate = bad UX
- Timeouts = bad UX
- Resource exhaustion = bad UX
-
Plan for Capacity
- Monitor trends, not just current values
- Alert before hitting limits
- Understand scaling triggers