Prometheus Metrics and Monitoring Guide

Overview

OmniTAS exports comprehensive operational metrics in Prometheus format for monitoring, alerting, and observability. This guide covers all available metrics, their usage, troubleshooting, and monitoring best practices.

Metrics Endpoint

All metrics are exposed at: http://<tas-ip>:8080/metrics

Complete Metric Reference

Diameter Metrics

`diameter_response_duration_milliseconds`

Type: Histogram Labels: application (ro, sh), command (ccr, cca, etc), result (success, error, timeout) Buckets: 10, 50, 100, 250, 500, 1000, 2500, 5000, 10000 ms Description: Duration of Diameter requests in milliseconds

Usage:

# Average Diameter Response Time
rate(diameter_response_duration_milliseconds_sum[5m]) /
rate(diameter_response_duration_milliseconds_count[5m])

# P95 Diameter latency
histogram_quantile(0.95, rate(diameter_response_duration_milliseconds_bucket[5m]))

Alert When:

P95 > 1000ms - Slow Diameter responses

`diameter_requests_total`

Type: Counter Labels: application (ro, sh), command (ccr, udr, etc) Description: Total number of Diameter requests sent

Usage:

# Request rate
rate(diameter_requests_total[5m])

`diameter_responses_total`

Type: Counter Labels: application (ro, sh), command (ccr, udr, etc), result_code (2001, 3002, 5xxx, etc) Description: Total number of Diameter responses received

Usage:

# Success rate
rate(diameter_responses_total{result_code="2001"}[5m]) /
rate(diameter_responses_total[5m]) * 100

`diameter_peer_state`

Type: Gauge Labels: peer_host, peer_realm, application (ro, sh) Description: State of Diameter peers (1=up, 0=down) Update interval: Every 10 seconds

Usage:

# Check for down peers
diameter_peer_state == 0

Alert When:

Any peer down for > 1 minute

Dialplan Generation Metrics

1. HTTP Request Metrics

`http_dialplan_request_duration_milliseconds`

Type: Histogram Labels: call_type (mt, mo, emergency, unknown) Description: End-to-end HTTP request duration from when the dialplan HTTP request is received to when the response is sent. This includes all processing: parameter parsing, authorization, Diameter lookups (Sh/Ro), HLR lookups (SS7 MAP), and XML generation.

Usage:

# Average end-to-end HTTP request time
rate(http_dialplan_request_duration_milliseconds_sum[5m]) /
rate(http_dialplan_request_duration_milliseconds_count[5m])

# P95 by call type
histogram_quantile(0.95,
  rate(http_dialplan_request_duration_milliseconds_bucket[5m])
) by (call_type)

# Compare MT vs MO performance
histogram_quantile(0.95,
  rate(http_dialplan_request_duration_milliseconds_bucket{call_type="mt"}[5m])
)
vs
histogram_quantile(0.95,
  rate(http_dialplan_request_duration_milliseconds_bucket{call_type="mo"}[5m])
)

Alert When:

P95 > 2000ms - Slow HTTP response times
P95 > 3000ms - Critical performance issue
P99 > 5000ms - Severe performance degradation
Any requests showing call_type="unknown" - Call type detection failure

Insights:

This is the most important metric for understanding user-facing latency
Typical values: P50: 100-500ms, P95: 500-2000ms, P99: 1000-3000ms
Includes all component timings (Sh + HLR + OCS + processing)
If this is slow, drill down into component metrics (subscriber_data, hlr_data, ocs_authorization)
Expected range: 100ms (fast local calls) to 5000ms (slow with retries/timeouts)

Important Notes:

Replaces the older dialplan_generation_duration_milliseconds metric which only measured XML generation
Accurately reflects what FreeSWITCH/SBC experiences
Use this for SLA monitoring and capacity planning

2. Subscriber Data Metrics

`subscriber_data_duration_milliseconds`

Type: Histogram Labels: result (success, error) Description: Time taken to retrieve subscriber data from Sh interface (HSS)

Usage:

# Average Sh lookup time
rate(subscriber_data_duration_milliseconds_sum[5m]) /
rate(subscriber_data_duration_milliseconds_count[5m])

# 95th percentile Sh lookup time
histogram_quantile(0.95,
  rate(subscriber_data_duration_milliseconds_bucket[5m])
)

Alert When:

P95 > 100ms - Slow HSS responses
P95 > 500ms - Critical HSS performance issue

`subscriber_data_lookups_total`

Type: Counter Labels: result (success, error) Description: Total number of subscriber data lookups

Usage:

# Sh lookup rate
rate(subscriber_data_lookups_total[5m])

# Sh error rate
rate(subscriber_data_lookups_total{result="error"}[5m])

# Sh success rate percentage
(rate(subscriber_data_lookups_total{result="success"}[5m]) /
 rate(subscriber_data_lookups_total[5m])) * 100

Alert When:

Error rate > 5% - HSS connectivity issues
Error rate > 20% - Critical HSS failure

2. HLR Data Metrics

`hlr_data_duration_milliseconds`

Type: Histogram Labels: result (success, error) Description: Time taken to retrieve HLR data via SS7 MAP

Usage:

# Average HLR lookup time
rate(hlr_data_duration_milliseconds_sum[5m]) /
rate(hlr_data_duration_milliseconds_count[5m])

# 95th percentile HLR lookup time
histogram_quantile(0.95,
  rate(hlr_data_duration_milliseconds_bucket[5m])
)

Alert When:

P95 > 500ms - Slow SS7 MAP responses
P95 > 2000ms - Critical SS7 MAP issue

`hlr_lookups_total`

Type: Counter Labels: result_type (msrn, forwarding, error, unknown) Description: Total HLR lookups by result type

Usage:

# HLR lookup rate by type
rate(hlr_lookups_total[5m])

# MSRN discovery rate (roaming subscribers)
rate(hlr_lookups_total{result_type="msrn"}[5m])

# Call forwarding discovery rate
rate(hlr_lookups_total{result_type="forwarding"}[5m])

# HLR error rate
rate(hlr_lookups_total{result_type="error"}[5m])

Alert When:

Error rate > 10% - SS7 MAP issues
Sudden drop in MSRN rate - Possible roaming issue

Insights:

High MSRN rate indicates many roaming subscribers
High forwarding rate indicates many forwarded calls
Compare to call volume for roaming percentage

3. OCS Authorization Metrics

`ocs_authorization_duration_milliseconds`

Type: Histogram Labels: result (success, error) Description: Time taken for OCS authorization

Usage:

# Average OCS auth time
rate(ocs_authorization_duration_milliseconds_sum[5m]) /
rate(ocs_authorization_duration_milliseconds_count[5m])

# 95th percentile OCS auth time
histogram_quantile(0.95,
  rate(ocs_authorization_duration_milliseconds_bucket[5m])
)

Alert When:

P95 > 1000ms - Slow OCS responses
P95 > 5000ms - Critical OCS performance issue

`ocs_authorization_attempts_total`

Type: Counter Labels: result (success, error), skipped (yes, no) Description: Total OCS authorization attempts

Usage:

# OCS authorization rate
rate(ocs_authorization_attempts_total{skipped="no"}[5m])

# OCS error rate
rate(ocs_authorization_attempts_total{result="error",skipped="no"}[5m])

# OCS skip rate (emergency, voicemail, etc.)
rate(ocs_authorization_attempts_total{skipped="yes"}[5m])

# OCS success rate percentage
(rate(ocs_authorization_attempts_total{result="success",skipped="no"}[5m]) /
 rate(ocs_authorization_attempts_total{skipped="no"}[5m])) * 100

Alert When:

Error rate > 5% - OCS connectivity issues
Success rate < 95% - OCS declining too many calls

Insights:

High skip rate indicates many emergency/free calls
Error rate spikes indicate OCS outages
Compare success rate to business expectations

4. Call Processing Metrics

`call_param_errors_total`

Type: Counter Labels: error_type (parse_failed, missing_required_params) Description: Call parameter parsing errors

Usage:

# Parameter error rate
rate(call_param_errors_total[5m])

# Errors by type
rate(call_param_errors_total[5m]) by (error_type)

Alert When:

Any errors > 0 - Indicates malformed call parameter requests
Errors > 1% of call volume - Critical issue

`authorization_decisions_total`

Type: Counter Labels: disposition (mt, mo, emergency, unauthorized), result (success, error) Description: Authorization decisions by call type

Usage:

# Authorization rate by disposition
rate(authorization_decisions_total[5m]) by (disposition)

# MT call rate
rate(authorization_decisions_total{disposition="mt"}[5m])

# MO call rate
rate(authorization_decisions_total{disposition="mo"}[5m])

# Emergency call rate
rate(authorization_decisions_total{disposition="emergency"}[5m])

# Unauthorized call rate
rate(authorization_decisions_total{disposition="unauthorized"}[5m])

Alert When:

Unauthorized rate > 1% - Possible attack or misconfiguration
Sudden spike in emergency calls - Possible emergency event
Unexpected change in MT/MO ratio - Possible issue

Insights:

MT/MO ratio indicates traffic patterns
Emergency call rate indicates service usage
Unauthorized rate indicates security posture

`freeswitch_variable_set_duration_milliseconds`

Type: Histogram Labels: batch_size (1, 5, 10, 25, 50, 100) Description: Time to set Dialplan Variables

Usage:

# Average variable set time
rate(freeswitch_variable_set_duration_milliseconds_sum[5m]) /
rate(freeswitch_variable_set_duration_milliseconds_count[5m])

# Variable set time by batch size
histogram_quantile(0.95,
  rate(freeswitch_variable_set_duration_milliseconds_bucket[5m])
) by (batch_size)

Alert When:

P95 > 100ms - Slow variable set performance
Growing trend - Possible system performance issue

5. Module Processing Metrics

`dialplan_module_duration_milliseconds`

Type: Histogram Labels: module (MT, MO, Emergency, CallParams, etc.), call_type Description: Processing time for each dialplan module

Usage:

# Processing time by module
histogram_quantile(0.95,
  rate(dialplan_module_duration_milliseconds_bucket[5m])
) by (module)

# MT module processing time
histogram_quantile(0.95,
  rate(dialplan_module_duration_milliseconds_bucket{module="MT"}[5m])
)

Alert When:

Any module P95 > 500ms - Performance issue
Growing trend in any module - Potential leak or issue

Insights:

Identify which module is slowest
Optimize the slowest modules first
Compare module times across call types

6. Call Volume Metrics

`call_attempts_total`

Type: Counter Labels: call_type (mt, mo, emergency, unauthorized), result (success, rejected) Description: Total call attempts

Usage:

# Call attempt rate
rate(call_attempts_total[5m])

# Success rate by call type
(rate(call_attempts_total{result="success"}[5m]) /
 rate(call_attempts_total[5m])) * 100 by (call_type)

# Rejected call rate
rate(call_attempts_total{result="rejected"}[5m])

Alert When:

Rejected rate > 5% - Possible issue
Sudden drop in call volume - Service outage
Sudden spike in call volume - Possible attack

`active_calls`

Type: Gauge Labels: call_type (mt, mo, emergency) Description: Currently active calls

Usage:

# Current active calls
active_calls

# Active calls by type
active_calls by (call_type)

# Peak active calls (last hour)
max_over_time(active_calls[1h])

Alert When:

Active calls > capacity - Overload
Active calls = 0 for extended time - Service down

7. Simulation Metrics

`call_simulations_total`

Type: Counter Labels: call_type (mt, mo, emergency, unauthorized), source (web, api) Description: Call simulations run

Usage:

# Simulation rate
rate(call_simulations_total[5m])

# Simulations by type
rate(call_simulations_total[5m]) by (call_type)

Insights:

Track diagnostic tool usage
Identify heavy users
Correlate with troubleshooting activity

8. SS7 MAP Metrics

`ss7_map_http_duration_milliseconds`

Type: Histogram Labels: operation (sri, prn), result (success, error, timeout) Buckets: 10, 50, 100, 250, 500, 1000, 2500, 5000, 10000 ms Description: Duration of SS7 MAP HTTP requests in milliseconds

Usage:

# SS7 MAP Error Rate
rate(ss7_map_operations_total{result="error"}[5m]) /
rate(ss7_map_operations_total[5m]) * 100

Alert When:

P95 > 500ms - Slow SS7 MAP responses
Error rate > 50% - Critical SS7 MAP issue

`ss7_map_operations_total`

Type: Counter Labels: operation (sri, prn), result (success, error) Description: Total number of SS7 MAP operations

9. Online Charging Metrics

`online_charging_events_total`

Type: Counter Labels: event_type (authorize, answer, reauth, hangup), result (success, nocredit, error, timeout) Description: Total number of online charging events

Usage:

# OCS Credit Failures
rate(online_charging_events_total{result="nocredit"}[5m])

Alert When:

High rate of credit failures

10. System State Metrics

`tracked_registrations`

Type: Gauge Description: Number of currently active SIP registrations (from FreeSWITCH Sofia registration database) Update interval: Every 10 seconds

Notes:

Automatically decrements when registrations expire (FreeSWITCH manages expiration)

`tracked_call_sessions`

Type: Gauge Description: Number of currently tracked call sessions in ETS Update interval: Every 10 seconds

11. HTTP Request Metrics

`http_requests_total`

Type: Counter Labels: endpoint (dialplan, call_event, directory, voicemail, sms_ccr, metrics), status_code (200, 400, 500, etc) Description: Total number of HTTP requests by endpoint

Usage:

# HTTP Error Rate
rate(http_requests_total{status_code=~"5.."}[5m]) /
rate(http_requests_total[5m]) * 100

Alert When:

HTTP 5xx error rate > 10%

12. Call Rejection Metrics

`call_rejections_total`

Type: Counter Labels: call_type (mo, mt, emergency, unknown), reason (nocredit, unauthorized, parse_failed, missing_params, hlr_error, etc) Description: Total number of call rejections by reason

Usage:

# Call Rejection Rate by Reason
sum by (reason) (rate(call_rejections_total[5m]))

Alert When:

Rejection rate > 1/sec - Investigation needed

13. Event Socket Connection Metrics

`event_socket_connected`

Type: Gauge Labels: connection_type (main, log_listener) Description: Event Socket connection state (1=connected, 0=disconnected) Update interval: Real-time on connection state changes

Usage:

# Event Socket Connection Status
event_socket_connected

Alert When:

Connection down for > 30 seconds

`event_socket_reconnections_total`

Type: Counter Labels: connection_type (main, log_listener), result (attempting, success, failed) Description: Total number of Event Socket reconnection attempts

Grafana Dashboard Integration

The metrics can be visualized in Grafana using the Prometheus data source. Recommended panels:

Dashboard 1: Call Volume

Active calls gauge
Call attempts rate by type (MO/MT/Emergency)
Call rejection rate

Dashboard 2: Diameter Performance

Response time heatmap
Request/response rates
Peer status table
Error rate by result code

Dashboard 3: Online Charging Health

Credit authorization success rate
"No credit" event rate
OCS timeout rate

Dashboard 4: System Performance

Dialplan generation latency (P50/P95/P99)
SS7 MAP response times
Overall system availability

Recommended Grafana Dashboard Layout

Row 1: Call Volume

Call attempts rate (by type)
Active calls gauge
Success rate percentage

Row 2: Performance

P95 HTTP dialplan request time (by call type) - PRIMARY METRIC
P95 Sh lookup time
P95 HLR lookup time
P95 OCS authorization time
P95 dialplan module processing time (by module)

Row 3: Success Rates

Sh lookup success rate
HLR lookup success rate
OCS authorization success rate
Call attempt success rate

Row 4: Module Performance

P95 processing time by module
Module call counts

Row 5: Errors

Parameter errors
Unauthorized attempts
Sh errors
HLR errors
OCS errors

Critical Alerts

Priority 1 (Page immediately):

# Dialplan completely down
rate(call_attempts_total[5m]) == 0

# HSS completely down
rate(subscriber_data_lookups_total{result="error"}[5m]) /
rate(subscriber_data_lookups_total[5m]) > 0.9

# OCS completely down
rate(ocs_authorization_attempts_total{result="error"}[5m]) /
rate(ocs_authorization_attempts_total[5m]) > 0.9

Priority 2 (Alert):

# Slow dialplan generation
histogram_quantile(0.95,
  rate(dialplan_generation_duration_milliseconds_bucket[5m])
) > 1000

# High HSS error rate
rate(subscriber_data_lookups_total{result="error"}[5m]) /
rate(subscriber_data_lookups_total[5m]) > 0.2

# High OCS error rate
rate(ocs_authorization_attempts_total{result="error"}[5m]) /
rate(ocs_authorization_attempts_total[5m]) > 0.1

Priority 3 (Warning):

# Elevated HSS latency
histogram_quantile(0.95,
  rate(subscriber_data_duration_milliseconds_bucket[5m])
) > 100

# Elevated OCS latency
histogram_quantile(0.95,
  rate(ocs_authorization_duration_milliseconds_bucket[5m])
) > 1000

# Moderate error rate
rate(call_attempts_total{result="rejected"}[5m]) /
rate(call_attempts_total[5m]) > 0.05

Alerting Examples

Diameter Peer Down

alert: DiameterPeerDown
expr: diameter_peer_state == 0
for: 1m
annotations:
  summary: "Diameter peer {{ $labels.peer_host }} is down"

High Diameter Latency

alert: HighDiameterLatency
expr: histogram_quantile(0.95, rate(diameter_response_duration_milliseconds_bucket[5m])) > 1000
for: 5m
annotations:
  summary: "Diameter P95 latency above 1s"

OCS Credit Failures

alert: HighOCSCreditFailures
expr: rate(online_charging_events_total{result="nocredit"}[5m]) > 0.1
for: 2m
annotations:
  summary: "High rate of OCS credit failures"

SS7 MAP Gateway Errors

alert: SS7MapErrors
expr: rate(ss7_map_operations_total{result="error"}[5m]) / rate(ss7_map_operations_total[5m]) > 0.5
for: 3m
annotations:
  summary: "SS7 MAP error rate above 50%"

Event Socket Disconnected

alert: EventSocketDown
expr: event_socket_connected == 0
for: 30s
annotations:
  summary: "Event Socket {{ $labels.connection_type }} disconnected"

High Call Rejection Rate

alert: HighCallRejectionRate
expr: rate(call_rejections_total[5m]) > 1
for: 2m
annotations:
  summary: "High call rejection rate: {{ $value }} rejections/sec"

HTTP Error Rate High

alert: HighHTTPErrorRate
expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1
for: 3m
annotations:
  summary: "HTTP 5xx error rate above 10%"

Troubleshooting with Metrics

Problem: Call type showing as "unknown"

Symptoms:

All metrics show call_type="unknown" instead of mt, mo, or emergency
Cannot differentiate performance between call types

Root Cause: The call type extraction is failing or not being properly passed through the processing pipeline.

Investigation:

Check logs for "HTTP dialplan request" messages - they should show the correct call type
Review system logs for call type processing errors

Resolution: Contact support if call type detection continues to fail.

Problem: Calls are slow

Investigation:

Check http_dialplan_request_duration_milliseconds P95 - START HERE
If high, check component timings:
- Check subscriber_data_duration_milliseconds for Sh delays
- Check hlr_data_duration_milliseconds for HLR delays
- Check ocs_authorization_duration_milliseconds for OCS delays
- Check dialplan_module_duration_milliseconds for module-specific delays
Check if call_type="unknown" - indicates call type detection failure
Compare MT vs MO vs Emergency processing times
Correlate with system logs for detailed error messages

Resolution: Optimize the slowest component

Problem: Calls are failing

Investigation:

Check call_attempts_total{result="rejected"} rate
Check subscriber_data_lookups_total{result="error"} for Sh issues
Check hlr_lookups_total{result_type="error"} for HLR issues
Check ocs_authorization_attempts_total{result="error"} for OCS issues
Check authorization_decisions_total{disposition="unauthorized"} for auth issues

Resolution: Fix the failing component

Problem: High load

Investigation:

Check active_calls current value
Check call_attempts_total rate
Check if rate matches expected traffic
Compare MT vs MO ratio
Check for unusual patterns (spikes, steady growth)

Resolution: Scale up or investigate unusual traffic

Problem: Roaming issues

Investigation:

Check hlr_lookups_total{result_type="msrn"} rate
Check hlr_data_duration_milliseconds for delays
Use HLR Lookup tool for specific subscribers
Check if MSRN is being retrieved correctly

Resolution: Fix HLR connectivity or configuration

Performance Baselines

Typical Values (Well-Tuned System)

HTTP dialplan request (end-to-end): P50: 100-500ms, P95: 500-2000ms, P99: 1000-3000ms
Sh lookup time: P50: 15ms, P95: 50ms, P99: 100ms
HLR lookup time: P50: 100ms, P95: 300ms, P99: 800ms
OCS auth time: P50: 150ms, P95: 500ms, P99: 1500ms
Dialplan module processing: P50: 1-5ms, P95: 10-25ms, P99: 50ms
Sh success rate: > 99%
HLR success rate: > 95% (lower is normal due to offline subscribers)
OCS success rate: > 98%
Call success rate: > 99%

Note: HTTP dialplan request time is the sum of all component times plus overhead. It should roughly equal: Sh lookup + HLR lookup + OCS auth + dialplan module processing + network/parsing overhead. Minimum expected time is ~100ms (when only Sh lookup is needed), maximum typical time is ~2000ms (with all lookups and retries).

Capacity Planning

Monitor these trends:

Growth in call_attempts_total rate
Growth in active_calls peak
Stable or improving P95 latencies
Stable or improving success rates

Plan for scaling when:

Active calls approaching 80% of capacity
P95 latencies growing despite stable load
Success rates declining despite stable external systems

Integration with Logging

Correlate metrics with logs:

High error rate in metrics → Search logs for ERROR messages
Slow response times → Search logs for WARNING messages about timeouts
Specific call issues → Search logs by call ID or phone number
Use simulation tool to reproduce and debug

Best Practices

Set up dashboards before issues occur
Define alert thresholds based on your baseline
Test alerts by using Call Simulator
Review metrics weekly to identify trends
Correlate metrics with business events (campaigns, outages, etc.)
Use metrics to justify infrastructure investments
Share dashboards with operations team
Document your alert response procedures

Configuration

Metrics collection is automatically enabled when the application starts. The metrics endpoint is exposed on the same port as the API (default: 8080).

To configure Prometheus to scrape metrics, add this job to your prometheus.yml:

scrape_configs:
  - job_name: 'omnitas'
    static_configs:
      - targets: ['<tas-ip>:8080']
    metrics_path: '/metrics'
    scrape_interval: 10s

Metric Cardinality

The metrics are designed with controlled cardinality to avoid overwhelming Prometheus:

Peer labels: Limited to configured peers only
Call types: Fixed set (mo, mt, emergency, unauthorized)
Result codes: Limited to actual Diameter/OCS result codes received
Operations: Fixed set per interface (sri/prn for MAP, ccr/cca for Diameter)

Total estimated time series: ~200-500 depending on number of configured peers and active result codes.

Metric Retention

Recommended retention periods:

Raw metrics: 30 days (high resolution)
5-minute aggregates: 90 days
1-hour aggregates: 1 year
Daily aggregates: 5 years

This supports:

Real-time troubleshooting (raw metrics)
Weekly/monthly analysis (5-min/1-hour aggregates)
Capacity planning (daily aggregates)
Historical comparison (yearly aggregates)

Overview​

Metrics Endpoint​

Complete Metric Reference​

Diameter Metrics​

diameter_response_duration_milliseconds​

diameter_requests_total​

diameter_responses_total​

diameter_peer_state​

Dialplan Generation Metrics​

1. HTTP Request Metrics​

http_dialplan_request_duration_milliseconds​

2. Subscriber Data Metrics​

subscriber_data_duration_milliseconds​

subscriber_data_lookups_total​

2. HLR Data Metrics​

hlr_data_duration_milliseconds​

hlr_lookups_total​

3. OCS Authorization Metrics​

ocs_authorization_duration_milliseconds​

ocs_authorization_attempts_total​

4. Call Processing Metrics​

call_param_errors_total​

authorization_decisions_total​

freeswitch_variable_set_duration_milliseconds​

5. Module Processing Metrics​

dialplan_module_duration_milliseconds​

6. Call Volume Metrics​

call_attempts_total​

active_calls​

7. Simulation Metrics​

call_simulations_total​

8. SS7 MAP Metrics​

ss7_map_http_duration_milliseconds​

ss7_map_operations_total​

9. Online Charging Metrics​

online_charging_events_total​

10. System State Metrics​

tracked_registrations​

tracked_call_sessions​

11. HTTP Request Metrics​

http_requests_total​

12. Call Rejection Metrics​

call_rejections_total​

13. Event Socket Connection Metrics​

event_socket_connected​

event_socket_reconnections_total​

Grafana Dashboard Integration​

Dashboard 1: Call Volume​

Dashboard 2: Diameter Performance​

Dashboard 3: Online Charging Health​

Dashboard 4: System Performance​

Recommended Grafana Dashboard Layout​

Critical Alerts​

Alerting Examples​

Diameter Peer Down​

High Diameter Latency​

OCS Credit Failures​

SS7 MAP Gateway Errors​

Event Socket Disconnected​

High Call Rejection Rate​

HTTP Error Rate High​

Troubleshooting with Metrics​

Problem: Call type showing as "unknown"​

Problem: Calls are slow​

Problem: Calls are failing​

Problem: High load​

Problem: Roaming issues​

Performance Baselines​

Typical Values (Well-Tuned System)​

Capacity Planning​

Integration with Logging​

Best Practices​

Configuration​

Metric Cardinality​

Metric Retention​

Overview

Metrics Endpoint

Complete Metric Reference

Diameter Metrics

`diameter_response_duration_milliseconds`

`diameter_requests_total`

`diameter_responses_total`

`diameter_peer_state`

Dialplan Generation Metrics

1. HTTP Request Metrics

`http_dialplan_request_duration_milliseconds`

2. Subscriber Data Metrics

`subscriber_data_duration_milliseconds`

`subscriber_data_lookups_total`

2. HLR Data Metrics

`hlr_data_duration_milliseconds`

`hlr_lookups_total`

3. OCS Authorization Metrics

`ocs_authorization_duration_milliseconds`

`ocs_authorization_attempts_total`

4. Call Processing Metrics

`call_param_errors_total`

`authorization_decisions_total`

`freeswitch_variable_set_duration_milliseconds`

5. Module Processing Metrics

`dialplan_module_duration_milliseconds`

6. Call Volume Metrics

`call_attempts_total`

`active_calls`

7. Simulation Metrics

`call_simulations_total`

8. SS7 MAP Metrics

`ss7_map_http_duration_milliseconds`

`ss7_map_operations_total`

9. Online Charging Metrics

`online_charging_events_total`

10. System State Metrics

`tracked_registrations`

`tracked_call_sessions`

11. HTTP Request Metrics

`http_requests_total`

12. Call Rejection Metrics

`call_rejections_total`

13. Event Socket Connection Metrics

`event_socket_connected`

`event_socket_reconnections_total`

Grafana Dashboard Integration

Dashboard 1: Call Volume

Dashboard 2: Diameter Performance

Dashboard 3: Online Charging Health

Dashboard 4: System Performance

Recommended Grafana Dashboard Layout

Critical Alerts

Alerting Examples

Diameter Peer Down

High Diameter Latency

OCS Credit Failures

SS7 MAP Gateway Errors

Event Socket Disconnected

High Call Rejection Rate

HTTP Error Rate High

Troubleshooting with Metrics

Problem: Call type showing as "unknown"

Problem: Calls are slow

Problem: Calls are failing

Problem: High load

Problem: Roaming issues

Performance Baselines

Typical Values (Well-Tuned System)

Capacity Planning

Integration with Logging

Best Practices

Configuration

Metric Cardinality

Metric Retention