Prometheus Metrics and Monitoring Guide
Overview
OmniTAS exports comprehensive operational metrics in Prometheus format for monitoring, alerting, and observability. This guide covers all available metrics, their usage, troubleshooting, and monitoring best practices.
Metrics Endpoint
All metrics are exposed at: http://<tas-ip>:8080/metrics
Important: Metric Time Unit Configuration
All duration metrics in this system use duration_unit: false in their Histogram declarations. This is critical because:
- The Prometheus Elixir library automatically detects metric names ending in
_milliseconds - By default, it converts native Erlang time units to milliseconds automatically
- Our code already converts time to milliseconds using
System.convert_time_unit/3 - Without
duration_unit: false, the library would convert milliseconds to nanoseconds (dividing by ~1,000,000)
Example:
# Correct configuration
Histogram.declare(
name: :http_dialplan_request_duration_milliseconds,
help: "Duration of HTTP dialplan requests in milliseconds",
labels: [:call_type],
buckets: [100, 250, 500, 750, 1000, 1500, 2000, 3000, 5000],
duration_unit: false # REQUIRED to prevent double conversion
)
# Measuring time correctly
start_time = System.monotonic_time()
# ... do work ...
end_time = System.monotonic_time()
duration_ms = System.convert_time_unit(end_time - start_time, :native, :millisecond)
Histogram.observe([name: :http_dialplan_request_duration_milliseconds], duration_ms)
Complete Metric Reference
Diameter Metrics
diameter_response_duration_milliseconds
Type: Histogram
Labels: application (ro, sh), command (ccr, cca, etc), result (success, error, timeout)
Buckets: 10, 50, 100, 250, 500, 1000, 2500, 5000, 10000 ms
Description: Duration of Diameter requests in milliseconds
Usage:
# Average Diameter Response Time
rate(diameter_response_duration_milliseconds_sum[5m]) /
rate(diameter_response_duration_milliseconds_count[5m])
# P95 Diameter latency
histogram_quantile(0.95, rate(diameter_response_duration_milliseconds_bucket[5m]))
Alert When:
- P95 > 1000ms - Slow Diameter responses
diameter_requests_total
Type: Counter
Labels: application (ro, sh), command (ccr, udr, etc)
Description: Total number of Diameter requests sent
Usage:
# Request rate
rate(diameter_requests_total[5m])
diameter_responses_total
Type: Counter
Labels: application (ro, sh), command (ccr, udr, etc), result_code (2001, 3002, 5xxx, etc)
Description: Total number of Diameter responses received
Usage:
# Success rate
rate(diameter_responses_total{result_code="2001"}[5m]) /
rate(diameter_responses_total[5m]) * 100
diameter_peer_state
Type: Gauge
Labels: peer_host, peer_realm, application (ro, sh)
Description: State of Diameter peers (1=up, 0=down)
Update interval: Every 10 seconds
Usage:
# Check for down peers
diameter_peer_state == 0
Alert When:
- Any peer down for > 1 minute
Dialplan Generation Metrics
1. HTTP Request Metrics
http_dialplan_request_duration_milliseconds
Type: Histogram
Labels: call_type (mt, mo, emergency, unknown)
Description: End-to-end HTTP request duration from when the dialplan HTTP request is received to when the response is sent. This includes all processing: parameter parsing, authorization, Diameter lookups (Sh/Ro), HLR lookups (SS7 MAP), and XML generation.
Usage:
# Average end-to-end HTTP request time
rate(http_dialplan_request_duration_milliseconds_sum[5m]) /
rate(http_dialplan_request_duration_milliseconds_count[5m])
# P95 by call type
histogram_quantile(0.95,
rate(http_dialplan_request_duration_milliseconds_bucket[5m])
) by (call_type)
# Compare MT vs MO performance
histogram_quantile(0.95,
rate(http_dialplan_request_duration_milliseconds_bucket{call_type="mt"}[5m])
)
vs
histogram_quantile(0.95,
rate(http_dialplan_request_duration_milliseconds_bucket{call_type="mo"}[5m])
)
Alert When:
- P95 > 2000ms - Slow HTTP response times
- P95 > 3000ms - Critical performance issue
- P99 > 5000ms - Severe performance degradation
- Any requests showing
call_type="unknown"- Call type detection failure
Insights:
- This is the most important metric for understanding user-facing latency
- Typical values: P50: 100-500ms, P95: 500-2000ms, P99: 1000-3000ms
- Includes all component timings (Sh + HLR + OCS + processing)
- If this is slow, drill down into component metrics (subscriber_data, hlr_data, ocs_authorization)
- Expected range: 100ms (fast local calls) to 5000ms (slow with retries/timeouts)
Important Notes:
- Replaces the older
dialplan_generation_duration_millisecondsmetric which only measured XML generation - Accurately reflects what FreeSWITCH/SBC experiences
- Use this for SLA monitoring and capacity planning
2. Subscriber Data Metrics
subscriber_data_duration_milliseconds
Type: Histogram
Labels: result (success, error)
Description: Time taken to retrieve subscriber data from Sh interface (HSS)
Usage:
# Average Sh lookup time
rate(subscriber_data_duration_milliseconds_sum[5m]) /
rate(subscriber_data_duration_milliseconds_count[5m])
# 95th percentile Sh lookup time
histogram_quantile(0.95,
rate(subscriber_data_duration_milliseconds_bucket[5m])
)
Alert When:
- P95 > 100ms - Slow HSS responses
- P95 > 500ms - Critical HSS performance issue
subscriber_data_lookups_total
Type: Counter
Labels: result (success, error)
Description: Total number of subscriber data lookups
Usage:
# Sh lookup rate
rate(subscriber_data_lookups_total[5m])
# Sh error rate
rate(subscriber_data_lookups_total{result="error"}[5m])
# Sh success rate percentage
(rate(subscriber_data_lookups_total{result="success"}[5m]) /
rate(subscriber_data_lookups_total[5m])) * 100
Alert When:
- Error rate > 5% - HSS connectivity issues
- Error rate > 20% - Critical HSS failure
2. HLR Data Metrics
hlr_data_duration_milliseconds
Type: Histogram
Labels: result (success, error)
Description: Time taken to retrieve HLR data via SS7 MAP
Usage:
# Average HLR lookup time
rate(hlr_data_duration_milliseconds_sum[5m]) /
rate(hlr_data_duration_milliseconds_count[5m])
# 95th percentile HLR lookup time
histogram_quantile(0.95,
rate(hlr_data_duration_milliseconds_bucket[5m])
)
Alert When:
- P95 > 500ms - Slow SS7 MAP responses
- P95 > 2000ms - Critical SS7 MAP issue
hlr_lookups_total
Type: Counter
Labels: result_type (msrn, forwarding, error, unknown)
Description: Total HLR lookups by result type
Usage:
# HLR lookup rate by type
rate(hlr_lookups_total[5m])
# MSRN discovery rate (roaming subscribers)
rate(hlr_lookups_total{result_type="msrn"}[5m])
# Call forwarding discovery rate
rate(hlr_lookups_total{result_type="forwarding"}[5m])
# HLR error rate
rate(hlr_lookups_total{result_type="error"}[5m])
Alert When:
- Error rate > 10% - SS7 MAP issues
- Sudden drop in MSRN rate - Possible roaming issue
Insights:
- High MSRN rate indicates many roaming subscribers
- High forwarding rate indicates many forwarded calls
- Compare to call volume for roaming percentage
3. OCS Authorization Metrics
ocs_authorization_duration_milliseconds
Type: Histogram
Labels: result (success, error)
Description: Time taken for OCS authorization
Usage:
# Average OCS auth time
rate(ocs_authorization_duration_milliseconds_sum[5m]) /
rate(ocs_authorization_duration_milliseconds_count[5m])
# 95th percentile OCS auth time
histogram_quantile(0.95,
rate(ocs_authorization_duration_milliseconds_bucket[5m])
)
Alert When:
- P95 > 1000ms - Slow OCS responses
- P95 > 5000ms - Critical OCS performance issue
ocs_authorization_attempts_total
Type: Counter
Labels: result (success, error), skipped (yes, no)
Description: Total OCS authorization attempts
Usage:
# OCS authorization rate
rate(ocs_authorization_attempts_total{skipped="no"}[5m])
# OCS error rate
rate(ocs_authorization_attempts_total{result="error",skipped="no"}[5m])
# OCS skip rate (emergency, voicemail, etc.)
rate(ocs_authorization_attempts_total{skipped="yes"}[5m])
# OCS success rate percentage
(rate(ocs_authorization_attempts_total{result="success",skipped="no"}[5m]) /
rate(ocs_authorization_attempts_total{skipped="no"}[5m])) * 100
Alert When:
- Error rate > 5% - OCS connectivity issues
- Success rate < 95% - OCS declining too many calls
Insights:
- High skip rate indicates many emergency/free calls
- Error rate spikes indicate OCS outages
- Compare success rate to business expectations
4. Call Processing Metrics
call_param_errors_total
Type: Counter
Labels: error_type (parse_failed, missing_required_params)
Description: Call parameter parsing errors
Usage:
# Parameter error rate
rate(call_param_errors_total[5m])
# Errors by type
rate(call_param_errors_total[5m]) by (error_type)
Alert When:
- Any errors > 0 - Indicates malformed call parameter requests
- Errors > 1% of call volume - Critical issue
authorization_decisions_total
Type: Counter
Labels: disposition (mt, mo, emergency, unauthorized), result (success, error)
Description: Authorization decisions by call type
Usage:
# Authorization rate by disposition
rate(authorization_decisions_total[5m]) by (disposition)
# MT call rate
rate(authorization_decisions_total{disposition="mt"}[5m])
# MO call rate
rate(authorization_decisions_total{disposition="mo"}[5m])
# Emergency call rate
rate(authorization_decisions_total{disposition="emergency"}[5m])
# Unauthorized call rate
rate(authorization_decisions_total{disposition="unauthorized"}[5m])
Alert When:
- Unauthorized rate > 1% - Possible attack or misconfiguration
- Sudden spike in emergency calls - Possible emergency event
- Unexpected change in MT/MO ratio - Possible issue
Insights:
- MT/MO ratio indicates traffic patterns
- Emergency call rate indicates service usage
- Unauthorized rate indicates security posture
freeswitch_variable_set_duration_milliseconds
Type: Histogram
Labels: batch_size (1, 5, 10, 25, 50, 100)
Description: Time to set Dialplan Variables
Usage:
# Average variable set time
rate(freeswitch_variable_set_duration_milliseconds_sum[5m]) /
rate(freeswitch_variable_set_duration_milliseconds_count[5m])
# Variable set time by batch size
histogram_quantile(0.95,
rate(freeswitch_variable_set_duration_milliseconds_bucket[5m])
) by (batch_size)
Alert When:
- P95 > 100ms - Slow variable set performance
- Growing trend - Possible system performance issue
5. Module Processing Metrics
dialplan_module_duration_milliseconds
Type: Histogram
Labels: module (MT, MO, Emergency, CallParams, etc.), call_type
Description: Processing time for each dialplan module
Usage:
# Processing time by module
histogram_quantile(0.95,
rate(dialplan_module_duration_milliseconds_bucket[5m])
) by (module)
# MT module processing time
histogram_quantile(0.95,
rate(dialplan_module_duration_milliseconds_bucket{module="MT"}[5m])
)
Alert When:
- Any module P95 > 500ms - Performance issue
- Growing trend in any module - Potential leak or issue
Insights:
- Identify which module is slowest
- Optimize the slowest modules first
- Compare module times across call types
6. Call Volume Metrics
call_attempts_total
Type: Counter
Labels: call_type (mt, mo, emergency, unauthorized), result (success, rejected)
Description: Total call attempts
Usage:
# Call attempt rate
rate(call_attempts_total[5m])
# Success rate by call type
(rate(call_attempts_total{result="success"}[5m]) /
rate(call_attempts_total[5m])) * 100 by (call_type)
# Rejected call rate
rate(call_attempts_total{result="rejected"}[5m])
Alert When:
- Rejected rate > 5% - Possible issue
- Sudden drop in call volume - Service outage
- Sudden spike in call volume - Possible attack
active_calls
Type: Gauge
Labels: call_type (mt, mo, emergency)
Description: Currently active calls
Usage:
# Current active calls
active_calls
# Active calls by type
active_calls by (call_type)
# Peak active calls (last hour)
max_over_time(active_calls[1h])
Alert When:
- Active calls > capacity - Overload
- Active calls = 0 for extended time - Service down
7. Simulation Metrics
call_simulations_total
Type: Counter
Labels: call_type (mt, mo, emergency, unauthorized), source (web, api)
Description: Call simulations run
Usage:
# Simulation rate
rate(call_simulations_total[5m])
# Simulations by type
rate(call_simulations_total[5m]) by (call_type)
Insights:
- Track diagnostic tool usage
- Identify heavy users
- Correlate with troubleshooting activity
8. SS7 MAP Metrics
ss7_map_http_duration_milliseconds
Type: Histogram
Labels: operation (sri, prn), result (success, error, timeout)
Buckets: 10, 50, 100, 250, 500, 1000, 2500, 5000, 10000 ms
Description: Duration of SS7 MAP HTTP requests in milliseconds
Usage:
# SS7 MAP Error Rate
rate(ss7_map_operations_total{result="error"}[5m]) /
rate(ss7_map_operations_total[5m]) * 100
Alert When:
- P95 > 500ms - Slow SS7 MAP responses
- Error rate > 50% - Critical SS7 MAP issue
ss7_map_operations_total
Type: Counter
Labels: operation (sri, prn), result (success, error)
Description: Total number of SS7 MAP operations
9. Online Charging Metrics
online_charging_events_total
Type: Counter
Labels: event_type (authorize, answer, reauth, hangup), result (success, nocredit, error, timeout)
Description: Total number of online charging events
Usage:
# OCS Credit Failures
rate(online_charging_events_total{result="nocredit"}[5m])
Alert When:
- High rate of credit failures
10. System State Metrics
tracked_registrations
Type: Gauge Description: Number of currently active SIP registrations (from FreeSWITCH Sofia registration database) Update interval: Every 10 seconds
Notes:
- Automatically decrements when registrations expire (FreeSWITCH manages expiration)
tracked_call_sessions
Type: Gauge Description: Number of currently tracked call sessions in ETS Update interval: Every 10 seconds
11. HTTP Request Metrics
http_requests_total
Type: Counter
Labels: endpoint (dialplan, call_event, directory, voicemail, sms_ccr, metrics), status_code (200, 400, 500, etc)
Description: Total number of HTTP requests by endpoint
Usage:
# HTTP Error Rate
rate(http_requests_total{status_code=~"5.."}[5m]) /
rate(http_requests_total[5m]) * 100
Alert When:
- HTTP 5xx error rate > 10%
12. Call Rejection Metrics
call_rejections_total
Type: Counter
Labels: call_type (mo, mt, emergency, unknown), reason (nocredit, unauthorized, parse_failed, missing_params, hlr_error, etc)
Description: Total number of call rejections by reason
Usage:
# Call Rejection Rate by Reason
sum by (reason) (rate(call_rejections_total[5m]))
Alert When:
- Rejection rate > 1/sec - Investigation needed
13. Event Socket Connection Metrics
event_socket_connected
Type: Gauge
Labels: connection_type (main, log_listener)
Description: Event Socket connection state (1=connected, 0=disconnected)
Update interval: Real-time on connection state changes
Usage:
# Event Socket Connection Status
event_socket_connected
Alert When:
- Connection down for > 30 seconds
event_socket_reconnections_total
Type: Counter
Labels: connection_type (main, log_listener), result (attempting, success, failed)
Description: Total number of Event Socket reconnection attempts
Grafana Dashboard Integration
The metrics can be visualized in Grafana using the Prometheus data source. Recommended panels:
Dashboard 1: Call Volume
- Active calls gauge
- Call attempts rate by type (MO/MT/Emergency)
- Call rejection rate
Dashboard 2: Diameter Performance
- Response time heatmap
- Request/response rates
- Peer status table
- Error rate by result code
Dashboard 3: Online Charging Health
- Credit authorization success rate
- "No credit" event rate
- OCS timeout rate
Dashboard 4: System Performance
- Dialplan generation latency (P50/P95/P99)
- SS7 MAP response times
- Overall system availability
Recommended Grafana Dashboard Layout
Row 1: Call Volume
- Call attempts rate (by type)
- Active calls gauge
- Success rate percentage
Row 2: Performance
- P95 HTTP dialplan request time (by call type) - PRIMARY METRIC
- P95 Sh lookup time
- P95 HLR lookup time
- P95 OCS authorization time
- P95 dialplan module processing time (by module)
Row 3: Success Rates
- Sh lookup success rate
- HLR lookup success rate
- OCS authorization success rate
- Call attempt success rate
Row 4: Module Performance
- P95 processing time by module
- Module call counts
Row 5: Errors
- Parameter errors
- Unauthorized attempts
- Sh errors
- HLR errors
- OCS errors
Critical Alerts
Priority 1 (Page immediately):
# Dialplan completely down
rate(call_attempts_total[5m]) == 0
# HSS completely down
rate(subscriber_data_lookups_total{result="error"}[5m]) /
rate(subscriber_data_lookups_total[5m]) > 0.9
# OCS completely down
rate(ocs_authorization_attempts_total{result="error"}[5m]) /
rate(ocs_authorization_attempts_total[5m]) > 0.9
Priority 2 (Alert):
# Slow dialplan generation
histogram_quantile(0.95,
rate(dialplan_generation_duration_milliseconds_bucket[5m])
) > 1000
# High HSS error rate
rate(subscriber_data_lookups_total{result="error"}[5m]) /
rate(subscriber_data_lookups_total[5m]) > 0.2
# High OCS error rate
rate(ocs_authorization_attempts_total{result="error"}[5m]) /
rate(ocs_authorization_attempts_total[5m]) > 0.1
Priority 3 (Warning):
# Elevated HSS latency
histogram_quantile(0.95,
rate(subscriber_data_duration_milliseconds_bucket[5m])
) > 100
# Elevated OCS latency
histogram_quantile(0.95,
rate(ocs_authorization_duration_milliseconds_bucket[5m])
) > 1000
# Moderate error rate
rate(call_attempts_total{result="rejected"}[5m]) /
rate(call_attempts_total[5m]) > 0.05
Alerting Examples
Diameter Peer Down
alert: DiameterPeerDown
expr: diameter_peer_state == 0
for: 1m
annotations:
summary: "Diameter peer {{ $labels.peer_host }} is down"
High Diameter Latency
alert: HighDiameterLatency
expr: histogram_quantile(0.95, rate(diameter_response_duration_milliseconds_bucket[5m])) > 1000
for: 5m
annotations:
summary: "Diameter P95 latency above 1s"
OCS Credit Failures
alert: HighOCSCreditFailures
expr: rate(online_charging_events_total{result="nocredit"}[5m]) > 0.1
for: 2m
annotations:
summary: "High rate of OCS credit failures"
SS7 MAP Gateway Errors
alert: SS7MapErrors
expr: rate(ss7_map_operations_total{result="error"}[5m]) / rate(ss7_map_operations_total[5m]) > 0.5
for: 3m
annotations:
summary: "SS7 MAP error rate above 50%"
Event Socket Disconnected
alert: EventSocketDown
expr: event_socket_connected == 0
for: 30s
annotations:
summary: "Event Socket {{ $labels.connection_type }} disconnected"
High Call Rejection Rate
alert: HighCallRejectionRate
expr: rate(call_rejections_total[5m]) > 1
for: 2m
annotations:
summary: "High call rejection rate: {{ $value }} rejections/sec"
HTTP Error Rate High
alert: HighHTTPErrorRate
expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1
for: 3m
annotations:
summary: "HTTP 5xx error rate above 10%"
Troubleshooting with Metrics
Problem: Metrics showing unrealistic values (nanoseconds instead of milliseconds)
Symptoms:
- Histogram
_sumvalues are extremely small (e.g., 0.000315 instead of 315) - All requests showing in the lowest bucket (< 5ms) when they should be slower
- Values appear to be 1,000,000x smaller than expected
Root Cause:
The Prometheus Elixir library automatically converts time units when metric names end in _milliseconds, _seconds, etc. If duration_unit: false is not set, the library will convert your already-converted milliseconds into nanoseconds.
Investigation:
- Check the metric declaration in lib/metrics.ex
- Verify
duration_unit: falseis present:Histogram.declare(
name: :some_duration_milliseconds,
help: "...",
buckets: [...],
duration_unit: false # Must be present!
) - Check the measurement code uses proper time conversion:
start = System.monotonic_time()
# ... work ...
duration_ms = System.convert_time_unit(
System.monotonic_time() - start,
:native,
:millisecond
)
Histogram.observe([name: :some_duration_milliseconds], duration_ms)
Resolution:
- Add
duration_unit: falseto the histogram declaration - Restart the application (required for metric declarations to reload)
- Verify metrics show realistic values after the fix
Example Fix:
# Before (WRONG - will show nanoseconds)
Histogram.declare(
name: :http_dialplan_request_duration_milliseconds,
buckets: [5, 10, 25, 50, 100, 250, 500, 1000, 2500]
)
# After (CORRECT - will show milliseconds)
Histogram.declare(
name: :http_dialplan_request_duration_milliseconds,
buckets: [100, 250, 500, 750, 1000, 1500, 2000, 3000, 5000],
duration_unit: false
)
Problem: Call type showing as "unknown"
Symptoms:
- All metrics show
call_type="unknown"instead ofmt,mo, oremergency - Cannot differentiate performance between call types
Root Cause: The call type extraction is failing or not being properly passed through the processing pipeline.
Investigation:
- Check logs for "HTTP dialplan request" messages - they should show the correct call type
- Verify
process_call/1returns{xml, call_type}tuple, not justxml - Verify
fsapi_conn/1extracts call type from the tuple:{xml, call_type} = process_call(body)
Resolution: Ensure the dialplan processing pipeline properly threads call type through all functions.
Problem: Calls are slow
Investigation:
- Check
http_dialplan_request_duration_millisecondsP95 - START HERE - If high, check component timings:
- Check
subscriber_data_duration_millisecondsfor Sh delays - Check
hlr_data_duration_millisecondsfor HLR delays - Check
ocs_authorization_duration_millisecondsfor OCS delays - Check
dialplan_module_duration_millisecondsfor module-specific delays
- Check
- Check if
call_type="unknown"- indicates call type detection failure - Compare MT vs MO vs Emergency processing times
- Correlate with system logs for detailed error messages
Resolution: Optimize the slowest component
Problem: Calls are failing
Investigation:
- Check
call_attempts_total{result="rejected"}rate - Check
subscriber_data_lookups_total{result="error"}for Sh issues - Check
hlr_lookups_total{result_type="error"}for HLR issues - Check
ocs_authorization_attempts_total{result="error"}for OCS issues - Check
authorization_decisions_total{disposition="unauthorized"}for auth issues
Resolution: Fix the failing component
Problem: High load
Investigation:
- Check
active_callscurrent value - Check
call_attempts_totalrate - Check if rate matches expected traffic
- Compare MT vs MO ratio
- Check for unusual patterns (spikes, steady growth)
Resolution: Scale up or investigate unusual traffic
Problem: Roaming issues
Investigation:
- Check
hlr_lookups_total{result_type="msrn"}rate - Check
hlr_data_duration_millisecondsfor delays - Use HLR Lookup tool for specific subscribers
- Check if MSRN is being retrieved correctly
Resolution: Fix HLR connectivity or configuration
Performance Baselines
Typical Values (Well-Tuned System)
- HTTP dialplan request (end-to-end): P50: 100-500ms, P95: 500-2000ms, P99: 1000-3000ms
- Sh lookup time: P50: 15ms, P95: 50ms, P99: 100ms
- HLR lookup time: P50: 100ms, P95: 300ms, P99: 800ms
- OCS auth time: P50: 150ms, P95: 500ms, P99: 1500ms
- Dialplan module processing: P50: 1-5ms, P95: 10-25ms, P99: 50ms
- Sh success rate: > 99%
- HLR success rate: > 95% (lower is normal due to offline subscribers)
- OCS success rate: > 98%
- Call success rate: > 99%
Note: HTTP dialplan request time is the sum of all component times plus overhead. It should roughly equal: Sh lookup + HLR lookup + OCS auth + dialplan module processing + network/parsing overhead. Minimum expected time is ~100ms (when only Sh lookup is needed), maximum typical time is ~2000ms (with all lookups and retries).
Capacity Planning
Monitor these trends:
- Growth in
call_attempts_totalrate - Growth in
active_callspeak - Stable or improving P95 latencies
- Stable or improving success rates
Plan for scaling when:
- Active calls approaching 80% of capacity
- P95 latencies growing despite stable load
- Success rates declining despite stable external systems
Integration with Logging
Correlate metrics with logs:
- High error rate in metrics → Search logs for ERROR messages
- Slow response times → Search logs for WARNING messages about timeouts
- Specific call issues → Search logs by call ID or phone number
- Use simulation tool to reproduce and debug
Best Practices
- Set up dashboards before issues occur
- Define alert thresholds based on your baseline
- Test alerts by using Call Simulator
- Review metrics weekly to identify trends
- Correlate metrics with business events (campaigns, outages, etc.)
- Use metrics to justify infrastructure investments
- Share dashboards with operations team
- Document your alert response procedures
Configuration
Metrics collection is automatically enabled when the application starts. The metrics endpoint is exposed on the same port as the API (default: 8080).
To configure Prometheus to scrape metrics, add this job to your prometheus.yml:
scrape_configs:
- job_name: 'omnitas'
static_configs:
- targets: ['<tas-ip>:8080']
metrics_path: '/metrics'
scrape_interval: 10s
Metric Cardinality
The metrics are designed with controlled cardinality to avoid overwhelming Prometheus:
- Peer labels: Limited to configured peers only
- Call types: Fixed set (mo, mt, emergency, unauthorized)
- Result codes: Limited to actual Diameter/OCS result codes received
- Operations: Fixed set per interface (sri/prn for MAP, ccr/cca for Diameter)
Total estimated time series: ~200-500 depending on number of configured peers and active result codes.
Metric Retention
Recommended retention periods:
- Raw metrics: 30 days (high resolution)
- 5-minute aggregates: 90 days
- 1-hour aggregates: 1 year
- Daily aggregates: 5 years
This supports:
- Real-time troubleshooting (raw metrics)
- Weekly/monthly analysis (5-min/1-hour aggregates)
- Capacity planning (daily aggregates)
- Historical comparison (yearly aggregates)