Skip to main content

Prometheus Metrics and Monitoring Guide

Overview

OmniTAS exports comprehensive operational metrics in Prometheus format for monitoring, alerting, and observability. This guide covers all available metrics, their usage, troubleshooting, and monitoring best practices.

Metrics Endpoint

All metrics are exposed at: http://<tas-ip>:8080/metrics

Important: Metric Time Unit Configuration

All duration metrics in this system use duration_unit: false in their Histogram declarations. This is critical because:

  1. The Prometheus Elixir library automatically detects metric names ending in _milliseconds
  2. By default, it converts native Erlang time units to milliseconds automatically
  3. Our code already converts time to milliseconds using System.convert_time_unit/3
  4. Without duration_unit: false, the library would convert milliseconds to nanoseconds (dividing by ~1,000,000)

Example:

# Correct configuration
Histogram.declare(
name: :http_dialplan_request_duration_milliseconds,
help: "Duration of HTTP dialplan requests in milliseconds",
labels: [:call_type],
buckets: [100, 250, 500, 750, 1000, 1500, 2000, 3000, 5000],
duration_unit: false # REQUIRED to prevent double conversion
)

# Measuring time correctly
start_time = System.monotonic_time()
# ... do work ...
end_time = System.monotonic_time()
duration_ms = System.convert_time_unit(end_time - start_time, :native, :millisecond)
Histogram.observe([name: :http_dialplan_request_duration_milliseconds], duration_ms)

Complete Metric Reference

Diameter Metrics

diameter_response_duration_milliseconds

Type: Histogram Labels: application (ro, sh), command (ccr, cca, etc), result (success, error, timeout) Buckets: 10, 50, 100, 250, 500, 1000, 2500, 5000, 10000 ms Description: Duration of Diameter requests in milliseconds

Usage:

# Average Diameter Response Time
rate(diameter_response_duration_milliseconds_sum[5m]) /
rate(diameter_response_duration_milliseconds_count[5m])

# P95 Diameter latency
histogram_quantile(0.95, rate(diameter_response_duration_milliseconds_bucket[5m]))

Alert When:

  • P95 > 1000ms - Slow Diameter responses

diameter_requests_total

Type: Counter Labels: application (ro, sh), command (ccr, udr, etc) Description: Total number of Diameter requests sent

Usage:

# Request rate
rate(diameter_requests_total[5m])

diameter_responses_total

Type: Counter Labels: application (ro, sh), command (ccr, udr, etc), result_code (2001, 3002, 5xxx, etc) Description: Total number of Diameter responses received

Usage:

# Success rate
rate(diameter_responses_total{result_code="2001"}[5m]) /
rate(diameter_responses_total[5m]) * 100

diameter_peer_state

Type: Gauge Labels: peer_host, peer_realm, application (ro, sh) Description: State of Diameter peers (1=up, 0=down) Update interval: Every 10 seconds

Usage:

# Check for down peers
diameter_peer_state == 0

Alert When:

  • Any peer down for > 1 minute

Dialplan Generation Metrics

1. HTTP Request Metrics

http_dialplan_request_duration_milliseconds

Type: Histogram Labels: call_type (mt, mo, emergency, unknown) Description: End-to-end HTTP request duration from when the dialplan HTTP request is received to when the response is sent. This includes all processing: parameter parsing, authorization, Diameter lookups (Sh/Ro), HLR lookups (SS7 MAP), and XML generation.

Usage:

# Average end-to-end HTTP request time
rate(http_dialplan_request_duration_milliseconds_sum[5m]) /
rate(http_dialplan_request_duration_milliseconds_count[5m])

# P95 by call type
histogram_quantile(0.95,
rate(http_dialplan_request_duration_milliseconds_bucket[5m])
) by (call_type)

# Compare MT vs MO performance
histogram_quantile(0.95,
rate(http_dialplan_request_duration_milliseconds_bucket{call_type="mt"}[5m])
)
vs
histogram_quantile(0.95,
rate(http_dialplan_request_duration_milliseconds_bucket{call_type="mo"}[5m])
)

Alert When:

  • P95 > 2000ms - Slow HTTP response times
  • P95 > 3000ms - Critical performance issue
  • P99 > 5000ms - Severe performance degradation
  • Any requests showing call_type="unknown" - Call type detection failure

Insights:

  • This is the most important metric for understanding user-facing latency
  • Typical values: P50: 100-500ms, P95: 500-2000ms, P99: 1000-3000ms
  • Includes all component timings (Sh + HLR + OCS + processing)
  • If this is slow, drill down into component metrics (subscriber_data, hlr_data, ocs_authorization)
  • Expected range: 100ms (fast local calls) to 5000ms (slow with retries/timeouts)

Important Notes:

  • Replaces the older dialplan_generation_duration_milliseconds metric which only measured XML generation
  • Accurately reflects what FreeSWITCH/SBC experiences
  • Use this for SLA monitoring and capacity planning

2. Subscriber Data Metrics

subscriber_data_duration_milliseconds

Type: Histogram Labels: result (success, error) Description: Time taken to retrieve subscriber data from Sh interface (HSS)

Usage:

# Average Sh lookup time
rate(subscriber_data_duration_milliseconds_sum[5m]) /
rate(subscriber_data_duration_milliseconds_count[5m])

# 95th percentile Sh lookup time
histogram_quantile(0.95,
rate(subscriber_data_duration_milliseconds_bucket[5m])
)

Alert When:

  • P95 > 100ms - Slow HSS responses
  • P95 > 500ms - Critical HSS performance issue

subscriber_data_lookups_total

Type: Counter Labels: result (success, error) Description: Total number of subscriber data lookups

Usage:

# Sh lookup rate
rate(subscriber_data_lookups_total[5m])

# Sh error rate
rate(subscriber_data_lookups_total{result="error"}[5m])

# Sh success rate percentage
(rate(subscriber_data_lookups_total{result="success"}[5m]) /
rate(subscriber_data_lookups_total[5m])) * 100

Alert When:

  • Error rate > 5% - HSS connectivity issues
  • Error rate > 20% - Critical HSS failure

2. HLR Data Metrics

hlr_data_duration_milliseconds

Type: Histogram Labels: result (success, error) Description: Time taken to retrieve HLR data via SS7 MAP

Usage:

# Average HLR lookup time
rate(hlr_data_duration_milliseconds_sum[5m]) /
rate(hlr_data_duration_milliseconds_count[5m])

# 95th percentile HLR lookup time
histogram_quantile(0.95,
rate(hlr_data_duration_milliseconds_bucket[5m])
)

Alert When:

  • P95 > 500ms - Slow SS7 MAP responses
  • P95 > 2000ms - Critical SS7 MAP issue

hlr_lookups_total

Type: Counter Labels: result_type (msrn, forwarding, error, unknown) Description: Total HLR lookups by result type

Usage:

# HLR lookup rate by type
rate(hlr_lookups_total[5m])

# MSRN discovery rate (roaming subscribers)
rate(hlr_lookups_total{result_type="msrn"}[5m])

# Call forwarding discovery rate
rate(hlr_lookups_total{result_type="forwarding"}[5m])

# HLR error rate
rate(hlr_lookups_total{result_type="error"}[5m])

Alert When:

  • Error rate > 10% - SS7 MAP issues
  • Sudden drop in MSRN rate - Possible roaming issue

Insights:

  • High MSRN rate indicates many roaming subscribers
  • High forwarding rate indicates many forwarded calls
  • Compare to call volume for roaming percentage

3. OCS Authorization Metrics

ocs_authorization_duration_milliseconds

Type: Histogram Labels: result (success, error) Description: Time taken for OCS authorization

Usage:

# Average OCS auth time
rate(ocs_authorization_duration_milliseconds_sum[5m]) /
rate(ocs_authorization_duration_milliseconds_count[5m])

# 95th percentile OCS auth time
histogram_quantile(0.95,
rate(ocs_authorization_duration_milliseconds_bucket[5m])
)

Alert When:

  • P95 > 1000ms - Slow OCS responses
  • P95 > 5000ms - Critical OCS performance issue

ocs_authorization_attempts_total

Type: Counter Labels: result (success, error), skipped (yes, no) Description: Total OCS authorization attempts

Usage:

# OCS authorization rate
rate(ocs_authorization_attempts_total{skipped="no"}[5m])

# OCS error rate
rate(ocs_authorization_attempts_total{result="error",skipped="no"}[5m])

# OCS skip rate (emergency, voicemail, etc.)
rate(ocs_authorization_attempts_total{skipped="yes"}[5m])

# OCS success rate percentage
(rate(ocs_authorization_attempts_total{result="success",skipped="no"}[5m]) /
rate(ocs_authorization_attempts_total{skipped="no"}[5m])) * 100

Alert When:

  • Error rate > 5% - OCS connectivity issues
  • Success rate < 95% - OCS declining too many calls

Insights:

  • High skip rate indicates many emergency/free calls
  • Error rate spikes indicate OCS outages
  • Compare success rate to business expectations

4. Call Processing Metrics

call_param_errors_total

Type: Counter Labels: error_type (parse_failed, missing_required_params) Description: Call parameter parsing errors

Usage:

# Parameter error rate
rate(call_param_errors_total[5m])

# Errors by type
rate(call_param_errors_total[5m]) by (error_type)

Alert When:

  • Any errors > 0 - Indicates malformed call parameter requests
  • Errors > 1% of call volume - Critical issue

authorization_decisions_total

Type: Counter Labels: disposition (mt, mo, emergency, unauthorized), result (success, error) Description: Authorization decisions by call type

Usage:

# Authorization rate by disposition
rate(authorization_decisions_total[5m]) by (disposition)

# MT call rate
rate(authorization_decisions_total{disposition="mt"}[5m])

# MO call rate
rate(authorization_decisions_total{disposition="mo"}[5m])

# Emergency call rate
rate(authorization_decisions_total{disposition="emergency"}[5m])

# Unauthorized call rate
rate(authorization_decisions_total{disposition="unauthorized"}[5m])

Alert When:

  • Unauthorized rate > 1% - Possible attack or misconfiguration
  • Sudden spike in emergency calls - Possible emergency event
  • Unexpected change in MT/MO ratio - Possible issue

Insights:

  • MT/MO ratio indicates traffic patterns
  • Emergency call rate indicates service usage
  • Unauthorized rate indicates security posture

freeswitch_variable_set_duration_milliseconds

Type: Histogram Labels: batch_size (1, 5, 10, 25, 50, 100) Description: Time to set Dialplan Variables

Usage:

# Average variable set time
rate(freeswitch_variable_set_duration_milliseconds_sum[5m]) /
rate(freeswitch_variable_set_duration_milliseconds_count[5m])

# Variable set time by batch size
histogram_quantile(0.95,
rate(freeswitch_variable_set_duration_milliseconds_bucket[5m])
) by (batch_size)

Alert When:

  • P95 > 100ms - Slow variable set performance
  • Growing trend - Possible system performance issue

5. Module Processing Metrics

dialplan_module_duration_milliseconds

Type: Histogram Labels: module (MT, MO, Emergency, CallParams, etc.), call_type Description: Processing time for each dialplan module

Usage:

# Processing time by module
histogram_quantile(0.95,
rate(dialplan_module_duration_milliseconds_bucket[5m])
) by (module)

# MT module processing time
histogram_quantile(0.95,
rate(dialplan_module_duration_milliseconds_bucket{module="MT"}[5m])
)

Alert When:

  • Any module P95 > 500ms - Performance issue
  • Growing trend in any module - Potential leak or issue

Insights:

  • Identify which module is slowest
  • Optimize the slowest modules first
  • Compare module times across call types

6. Call Volume Metrics

call_attempts_total

Type: Counter Labels: call_type (mt, mo, emergency, unauthorized), result (success, rejected) Description: Total call attempts

Usage:

# Call attempt rate
rate(call_attempts_total[5m])

# Success rate by call type
(rate(call_attempts_total{result="success"}[5m]) /
rate(call_attempts_total[5m])) * 100 by (call_type)

# Rejected call rate
rate(call_attempts_total{result="rejected"}[5m])

Alert When:

  • Rejected rate > 5% - Possible issue
  • Sudden drop in call volume - Service outage
  • Sudden spike in call volume - Possible attack

active_calls

Type: Gauge Labels: call_type (mt, mo, emergency) Description: Currently active calls

Usage:

# Current active calls
active_calls

# Active calls by type
active_calls by (call_type)

# Peak active calls (last hour)
max_over_time(active_calls[1h])

Alert When:

  • Active calls > capacity - Overload
  • Active calls = 0 for extended time - Service down

7. Simulation Metrics

call_simulations_total

Type: Counter Labels: call_type (mt, mo, emergency, unauthorized), source (web, api) Description: Call simulations run

Usage:

# Simulation rate
rate(call_simulations_total[5m])

# Simulations by type
rate(call_simulations_total[5m]) by (call_type)

Insights:

  • Track diagnostic tool usage
  • Identify heavy users
  • Correlate with troubleshooting activity

8. SS7 MAP Metrics

ss7_map_http_duration_milliseconds

Type: Histogram Labels: operation (sri, prn), result (success, error, timeout) Buckets: 10, 50, 100, 250, 500, 1000, 2500, 5000, 10000 ms Description: Duration of SS7 MAP HTTP requests in milliseconds

Usage:

# SS7 MAP Error Rate
rate(ss7_map_operations_total{result="error"}[5m]) /
rate(ss7_map_operations_total[5m]) * 100

Alert When:

  • P95 > 500ms - Slow SS7 MAP responses
  • Error rate > 50% - Critical SS7 MAP issue

ss7_map_operations_total

Type: Counter Labels: operation (sri, prn), result (success, error) Description: Total number of SS7 MAP operations

9. Online Charging Metrics

online_charging_events_total

Type: Counter Labels: event_type (authorize, answer, reauth, hangup), result (success, nocredit, error, timeout) Description: Total number of online charging events

Usage:

# OCS Credit Failures
rate(online_charging_events_total{result="nocredit"}[5m])

Alert When:

  • High rate of credit failures

10. System State Metrics

tracked_registrations

Type: Gauge Description: Number of currently active SIP registrations (from FreeSWITCH Sofia registration database) Update interval: Every 10 seconds

Notes:

  • Automatically decrements when registrations expire (FreeSWITCH manages expiration)

tracked_call_sessions

Type: Gauge Description: Number of currently tracked call sessions in ETS Update interval: Every 10 seconds

11. HTTP Request Metrics

http_requests_total

Type: Counter Labels: endpoint (dialplan, call_event, directory, voicemail, sms_ccr, metrics), status_code (200, 400, 500, etc) Description: Total number of HTTP requests by endpoint

Usage:

# HTTP Error Rate
rate(http_requests_total{status_code=~"5.."}[5m]) /
rate(http_requests_total[5m]) * 100

Alert When:

  • HTTP 5xx error rate > 10%

12. Call Rejection Metrics

call_rejections_total

Type: Counter Labels: call_type (mo, mt, emergency, unknown), reason (nocredit, unauthorized, parse_failed, missing_params, hlr_error, etc) Description: Total number of call rejections by reason

Usage:

# Call Rejection Rate by Reason
sum by (reason) (rate(call_rejections_total[5m]))

Alert When:

  • Rejection rate > 1/sec - Investigation needed

13. Event Socket Connection Metrics

event_socket_connected

Type: Gauge Labels: connection_type (main, log_listener) Description: Event Socket connection state (1=connected, 0=disconnected) Update interval: Real-time on connection state changes

Usage:

# Event Socket Connection Status
event_socket_connected

Alert When:

  • Connection down for > 30 seconds

event_socket_reconnections_total

Type: Counter Labels: connection_type (main, log_listener), result (attempting, success, failed) Description: Total number of Event Socket reconnection attempts

Grafana Dashboard Integration

The metrics can be visualized in Grafana using the Prometheus data source. Recommended panels:

Dashboard 1: Call Volume

  • Active calls gauge
  • Call attempts rate by type (MO/MT/Emergency)
  • Call rejection rate

Dashboard 2: Diameter Performance

  • Response time heatmap
  • Request/response rates
  • Peer status table
  • Error rate by result code

Dashboard 3: Online Charging Health

  • Credit authorization success rate
  • "No credit" event rate
  • OCS timeout rate

Dashboard 4: System Performance

  • Dialplan generation latency (P50/P95/P99)
  • SS7 MAP response times
  • Overall system availability

Row 1: Call Volume

  • Call attempts rate (by type)
  • Active calls gauge
  • Success rate percentage

Row 2: Performance

  • P95 HTTP dialplan request time (by call type) - PRIMARY METRIC
  • P95 Sh lookup time
  • P95 HLR lookup time
  • P95 OCS authorization time
  • P95 dialplan module processing time (by module)

Row 3: Success Rates

  • Sh lookup success rate
  • HLR lookup success rate
  • OCS authorization success rate
  • Call attempt success rate

Row 4: Module Performance

  • P95 processing time by module
  • Module call counts

Row 5: Errors

  • Parameter errors
  • Unauthorized attempts
  • Sh errors
  • HLR errors
  • OCS errors

Critical Alerts

Priority 1 (Page immediately):

# Dialplan completely down
rate(call_attempts_total[5m]) == 0

# HSS completely down
rate(subscriber_data_lookups_total{result="error"}[5m]) /
rate(subscriber_data_lookups_total[5m]) > 0.9

# OCS completely down
rate(ocs_authorization_attempts_total{result="error"}[5m]) /
rate(ocs_authorization_attempts_total[5m]) > 0.9

Priority 2 (Alert):

# Slow dialplan generation
histogram_quantile(0.95,
rate(dialplan_generation_duration_milliseconds_bucket[5m])
) > 1000

# High HSS error rate
rate(subscriber_data_lookups_total{result="error"}[5m]) /
rate(subscriber_data_lookups_total[5m]) > 0.2

# High OCS error rate
rate(ocs_authorization_attempts_total{result="error"}[5m]) /
rate(ocs_authorization_attempts_total[5m]) > 0.1

Priority 3 (Warning):

# Elevated HSS latency
histogram_quantile(0.95,
rate(subscriber_data_duration_milliseconds_bucket[5m])
) > 100

# Elevated OCS latency
histogram_quantile(0.95,
rate(ocs_authorization_duration_milliseconds_bucket[5m])
) > 1000

# Moderate error rate
rate(call_attempts_total{result="rejected"}[5m]) /
rate(call_attempts_total[5m]) > 0.05

Alerting Examples

Diameter Peer Down

alert: DiameterPeerDown
expr: diameter_peer_state == 0
for: 1m
annotations:
summary: "Diameter peer {{ $labels.peer_host }} is down"

High Diameter Latency

alert: HighDiameterLatency
expr: histogram_quantile(0.95, rate(diameter_response_duration_milliseconds_bucket[5m])) > 1000
for: 5m
annotations:
summary: "Diameter P95 latency above 1s"

OCS Credit Failures

alert: HighOCSCreditFailures
expr: rate(online_charging_events_total{result="nocredit"}[5m]) > 0.1
for: 2m
annotations:
summary: "High rate of OCS credit failures"

SS7 MAP Gateway Errors

alert: SS7MapErrors
expr: rate(ss7_map_operations_total{result="error"}[5m]) / rate(ss7_map_operations_total[5m]) > 0.5
for: 3m
annotations:
summary: "SS7 MAP error rate above 50%"

Event Socket Disconnected

alert: EventSocketDown
expr: event_socket_connected == 0
for: 30s
annotations:
summary: "Event Socket {{ $labels.connection_type }} disconnected"

High Call Rejection Rate

alert: HighCallRejectionRate
expr: rate(call_rejections_total[5m]) > 1
for: 2m
annotations:
summary: "High call rejection rate: {{ $value }} rejections/sec"

HTTP Error Rate High

alert: HighHTTPErrorRate
expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1
for: 3m
annotations:
summary: "HTTP 5xx error rate above 10%"

Troubleshooting with Metrics

Problem: Metrics showing unrealistic values (nanoseconds instead of milliseconds)

Symptoms:

  • Histogram _sum values are extremely small (e.g., 0.000315 instead of 315)
  • All requests showing in the lowest bucket (< 5ms) when they should be slower
  • Values appear to be 1,000,000x smaller than expected

Root Cause: The Prometheus Elixir library automatically converts time units when metric names end in _milliseconds, _seconds, etc. If duration_unit: false is not set, the library will convert your already-converted milliseconds into nanoseconds.

Investigation:

  1. Check the metric declaration in lib/metrics.ex
  2. Verify duration_unit: false is present:
    Histogram.declare(
    name: :some_duration_milliseconds,
    help: "...",
    buckets: [...],
    duration_unit: false # Must be present!
    )
  3. Check the measurement code uses proper time conversion:
    start = System.monotonic_time()
    # ... work ...
    duration_ms = System.convert_time_unit(
    System.monotonic_time() - start,
    :native,
    :millisecond
    )
    Histogram.observe([name: :some_duration_milliseconds], duration_ms)

Resolution:

  1. Add duration_unit: false to the histogram declaration
  2. Restart the application (required for metric declarations to reload)
  3. Verify metrics show realistic values after the fix

Example Fix:

# Before (WRONG - will show nanoseconds)
Histogram.declare(
name: :http_dialplan_request_duration_milliseconds,
buckets: [5, 10, 25, 50, 100, 250, 500, 1000, 2500]
)

# After (CORRECT - will show milliseconds)
Histogram.declare(
name: :http_dialplan_request_duration_milliseconds,
buckets: [100, 250, 500, 750, 1000, 1500, 2000, 3000, 5000],
duration_unit: false
)

Problem: Call type showing as "unknown"

Symptoms:

  • All metrics show call_type="unknown" instead of mt, mo, or emergency
  • Cannot differentiate performance between call types

Root Cause: The call type extraction is failing or not being properly passed through the processing pipeline.

Investigation:

  1. Check logs for "HTTP dialplan request" messages - they should show the correct call type
  2. Verify process_call/1 returns {xml, call_type} tuple, not just xml
  3. Verify fsapi_conn/1 extracts call type from the tuple: {xml, call_type} = process_call(body)

Resolution: Ensure the dialplan processing pipeline properly threads call type through all functions.

Problem: Calls are slow

Investigation:

  1. Check http_dialplan_request_duration_milliseconds P95 - START HERE
  2. If high, check component timings:
    • Check subscriber_data_duration_milliseconds for Sh delays
    • Check hlr_data_duration_milliseconds for HLR delays
    • Check ocs_authorization_duration_milliseconds for OCS delays
    • Check dialplan_module_duration_milliseconds for module-specific delays
  3. Check if call_type="unknown" - indicates call type detection failure
  4. Compare MT vs MO vs Emergency processing times
  5. Correlate with system logs for detailed error messages

Resolution: Optimize the slowest component

Problem: Calls are failing

Investigation:

  1. Check call_attempts_total{result="rejected"} rate
  2. Check subscriber_data_lookups_total{result="error"} for Sh issues
  3. Check hlr_lookups_total{result_type="error"} for HLR issues
  4. Check ocs_authorization_attempts_total{result="error"} for OCS issues
  5. Check authorization_decisions_total{disposition="unauthorized"} for auth issues

Resolution: Fix the failing component

Problem: High load

Investigation:

  1. Check active_calls current value
  2. Check call_attempts_total rate
  3. Check if rate matches expected traffic
  4. Compare MT vs MO ratio
  5. Check for unusual patterns (spikes, steady growth)

Resolution: Scale up or investigate unusual traffic

Problem: Roaming issues

Investigation:

  1. Check hlr_lookups_total{result_type="msrn"} rate
  2. Check hlr_data_duration_milliseconds for delays
  3. Use HLR Lookup tool for specific subscribers
  4. Check if MSRN is being retrieved correctly

Resolution: Fix HLR connectivity or configuration

Performance Baselines

Typical Values (Well-Tuned System)

  • HTTP dialplan request (end-to-end): P50: 100-500ms, P95: 500-2000ms, P99: 1000-3000ms
  • Sh lookup time: P50: 15ms, P95: 50ms, P99: 100ms
  • HLR lookup time: P50: 100ms, P95: 300ms, P99: 800ms
  • OCS auth time: P50: 150ms, P95: 500ms, P99: 1500ms
  • Dialplan module processing: P50: 1-5ms, P95: 10-25ms, P99: 50ms
  • Sh success rate: > 99%
  • HLR success rate: > 95% (lower is normal due to offline subscribers)
  • OCS success rate: > 98%
  • Call success rate: > 99%

Note: HTTP dialplan request time is the sum of all component times plus overhead. It should roughly equal: Sh lookup + HLR lookup + OCS auth + dialplan module processing + network/parsing overhead. Minimum expected time is ~100ms (when only Sh lookup is needed), maximum typical time is ~2000ms (with all lookups and retries).

Capacity Planning

Monitor these trends:

  • Growth in call_attempts_total rate
  • Growth in active_calls peak
  • Stable or improving P95 latencies
  • Stable or improving success rates

Plan for scaling when:

  • Active calls approaching 80% of capacity
  • P95 latencies growing despite stable load
  • Success rates declining despite stable external systems

Integration with Logging

Correlate metrics with logs:

  1. High error rate in metrics → Search logs for ERROR messages
  2. Slow response times → Search logs for WARNING messages about timeouts
  3. Specific call issues → Search logs by call ID or phone number
  4. Use simulation tool to reproduce and debug

Best Practices

  1. Set up dashboards before issues occur
  2. Define alert thresholds based on your baseline
  3. Test alerts by using Call Simulator
  4. Review metrics weekly to identify trends
  5. Correlate metrics with business events (campaigns, outages, etc.)
  6. Use metrics to justify infrastructure investments
  7. Share dashboards with operations team
  8. Document your alert response procedures

Configuration

Metrics collection is automatically enabled when the application starts. The metrics endpoint is exposed on the same port as the API (default: 8080).

To configure Prometheus to scrape metrics, add this job to your prometheus.yml:

scrape_configs:
- job_name: 'omnitas'
static_configs:
- targets: ['<tas-ip>:8080']
metrics_path: '/metrics'
scrape_interval: 10s

Metric Cardinality

The metrics are designed with controlled cardinality to avoid overwhelming Prometheus:

  • Peer labels: Limited to configured peers only
  • Call types: Fixed set (mo, mt, emergency, unauthorized)
  • Result codes: Limited to actual Diameter/OCS result codes received
  • Operations: Fixed set per interface (sri/prn for MAP, ccr/cca for Diameter)

Total estimated time series: ~200-500 depending on number of configured peers and active result codes.

Metric Retention

Recommended retention periods:

  • Raw metrics: 30 days (high resolution)
  • 5-minute aggregates: 90 days
  • 1-hour aggregates: 1 year
  • Daily aggregates: 5 years

This supports:

  • Real-time troubleshooting (raw metrics)
  • Weekly/monthly analysis (5-min/1-hour aggregates)
  • Capacity planning (daily aggregates)
  • Historical comparison (yearly aggregates)