Skip to main content

Metrics Documentation

This document describes the Prometheus metrics exposed by the IMS Application Server components.

Table of Contents

Metrics Endpoints

PortEndpointPurposeJump to Section
9090/metricsSystem, gateway, and core telephony metricsPort 9090 →
8080/metricsTAS engine, Diameter, HLR, OCS, and Erlang VM metricsPort 8080 →
9093/esl?module=defaultRTP/RTCP media quality and call statisticsPort 9093 →

Port 9090 - System Metrics

Call and Session Metrics

Metric NamePortDescription
freeswitch_bridged_calls9090Number of bridged calls currently active
freeswitch_detailed_bridged_calls9090Number of detailed bridged calls active
freeswitch_current_calls9090Number of calls currently active
freeswitch_detailed_calls9090Number of detailed calls active
freeswitch_current_channels9090Number of channels currently active
freeswitch_current_sessions9090Number of sessions currently active
freeswitch_current_sessions_peak9090Peak number of sessions since startup
freeswitch_current_sessions_peak_last_5min9090Peak number of sessions in the last 5 minutes
freeswitch_sessions_total9090Total number of sessions since startup (counter)
freeswitch_current_sps9090Current sessions per second
freeswitch_current_sps_peak9090Peak sessions per second since startup
freeswitch_current_sps_peak_last_5min9090Peak sessions per second in the last 5 minutes
freeswitch_max_sessions9090Maximum number of sessions allowed
freeswitch_max_sps9090Maximum sessions per second allowed

System Resource Metrics

Metric NamePortDescription
freeswitch_current_idle_cpu9090Current CPU idle percentage
freeswitch_min_idle_cpu9090Minimum CPU idle percentage recorded
freeswitch_uptime_seconds9090Uptime in seconds
freeswitch_time_synced9090Whether system time is in sync with exporter host time (1=synced, 0=not synced)

Memory Metrics

Metric NamePortDescription
freeswitch_memory_arena9090Total non-mmapped bytes (malloc arena)
freeswitch_memory_ordblks9090Number of free chunks
freeswitch_memory_smblks9090Number of free fastbin blocks
freeswitch_memory_hblks9090Number of mapped regions
freeswitch_memory_hblkhd9090Bytes in mapped regions
freeswitch_memory_usmblks9090Maximum total allocated space
freeswitch_memory_fsmblks9090Free bytes held in fastbins
freeswitch_memory_uordblks9090Total allocated space
freeswitch_memory_fordblks9090Total free space
freeswitch_memory_keepcost9090Topmost releasable block

Codec Status Metrics

Metric NamePortDescription
freeswitch_codec_status9090Codec status with labels: ikey (module), name (codec name), type (codec). Value=1 indicates codec is available

Available Codecs Include:

  • G.711 alaw/ulaw
  • PROXY PASS-THROUGH
  • PROXY VIDEO PASS-THROUGH
  • RAW Signed Linear (16 bit)
  • Speex
  • VP8/VP9 Video
  • AMR variants
  • B64
  • G.723.1, G.729, G.722, G.726 variants
  • OPUS
  • MP3
  • ADPCM, GSM, LPC-10

Endpoint Status Metrics

Metric NamePortDescription
freeswitch_endpoint_status9090Endpoint status with labels: ikey (module), name (endpoint name), type (endpoint). Value=1 indicates endpoint is available

Available Endpoints Include:

  • error, group, pickup, user (mod_dptools)
  • loopback, null (mod_loopback)
  • rtc (mod_rtc)
  • rtp, sofia (mod_sofia)
  • modem (mod_spandsp)

Module Status Metrics

Metric NamePortDescription
freeswitch_load_module9090Module load status (1=loaded, 0=not loaded) with label: module

Key Modules Monitored:

  • mod_sofia (SIP)
  • mod_conference, mod_conference_ims
  • mod_opus, mod_g729, mod_amr, etc.
  • mod_event_socket
  • mod_dptools
  • mod_python3
  • mod_rtc
  • And many more...

Registration Metrics

Metric NamePortDescription
freeswitch_registrations9090Total number of active registrations
freeswitch_registration_defails9090Detailed registration information with labels: expires, hostname, network_ip, network_port, network_proto, realm, reg_user, token, url

Sofia Gateway Metrics

Metric NamePortDescription
freeswitch_sofia_gateway_status9090Gateway status with labels: context, name, profile, proxy, scheme, status (UP/DOWN)
freeswitch_sofia_gateway_call_in9090Number of inbound calls through gateway
freeswitch_sofia_gateway_call_out9090Number of outbound calls through gateway
freeswitch_sofia_gateway_failed_call_in9090Number of failed inbound calls
freeswitch_sofia_gateway_failed_call_out9090Number of failed outbound calls
freeswitch_sofia_gateway_ping9090Last ping timestamp (Unix epoch)
freeswitch_sofia_gateway_pingtime9090Last ping time in milliseconds
freeswitch_sofia_gateway_pingfreq9090Ping frequency in seconds
freeswitch_sofia_gateway_pingcount9090Number of pings sent
freeswitch_sofia_gateway_pingmin9090Minimum ping time recorded
freeswitch_sofia_gateway_pingmax9090Maximum ping time recorded

Exporter Health Metrics

Metric NamePortDescription
freeswitch_up9090Whether the last scrape was successful (1=success, 0=failure)
freeswitch_exporter_total_scrapes9090Total number of scrapes performed (counter)
freeswitch_exporter_failed_scrapes9090Total number of failed scrapes (counter)

↑ Back to top

Port 8080 - TAS Engine Metrics

These metrics are exposed by the Telephony Application Server engine and provide insight into call processing, database operations, and Erlang VM performance.

Application Call Metrics

Metric NamePortDescription
call_simulations_total8080Total number of call simulations (counter)
call_attempts_total8080Total number of call attempts (counter)
call_rejections_total8080Total number of call rejections by reason (counter)
call_param_errors_total8080Total number of call parameter parsing errors (counter)
active_calls8080Number of currently active calls with labels: call_type (mo/mt/emergency)
tracked_call_sessions8080Number of currently tracked call sessions in ETS

Diameter Protocol Metrics

Metric NamePortDescription
diameter_peer_state8080State of Diameter peers (1=up, 0=down) with labels: peer_host, peer_realm, application
diameter_requests_total8080Total number of Diameter requests (counter)
diameter_responses_total8080Total number of Diameter responses (counter)
diameter_response_duration_milliseconds8080Duration of Diameter requests in milliseconds (histogram)

Telephony Operations Metrics

Metric NamePortDescription
hlr_lookups_total8080Total number of HLR lookups (counter)
hlr_data_duration_milliseconds8080Duration of HLR data retrieval in milliseconds (histogram)
subscriber_data_lookups_total8080Total number of subscriber data lookups (counter)
subscriber_data_duration_milliseconds8080Duration of Sh subscriber data retrieval in milliseconds (histogram)
ss7_map_operations_total8080Total number of SS7 MAP operations (counter)
ss7_map_http_duration_milliseconds8080Duration of SS7 MAP HTTP requests in milliseconds (histogram)
tracked_registrations8080Number of currently tracked SIP registrations

Online Charging System (OCS) Metrics

Metric NamePortDescription
ocs_authorization_attempts_total8080Total number of OCS authorization attempts (counter)
ocs_authorization_duration_milliseconds8080Duration of OCS authorization in milliseconds (histogram)
online_charging_events_total8080Total number of online charging events (counter)
authorization_decisions_total8080Total number of authorization decisions (counter)

Dialplan & Processing Metrics

Metric NamePortDescription
http_requests_total8080Total number of HTTP requests with labels: endpoint, status_code (counter)
http_dialplan_request_duration_milliseconds8080Duration of HTTP dialplan requests in milliseconds (histogram)
dialplan_module_duration_milliseconds8080Duration of individual dialplan module processing (histogram)
freeswitch_variable_set_duration_milliseconds8080Duration of variable setting operations (histogram)

Event Socket Metrics

Metric NamePortDescription
event_socket_connected8080Event Socket connection state (1=connected, 0=disconnected) with label: connection_type
event_socket_reconnections_total8080Total number of Event Socket reconnection attempts (counter)

Erlang Mnesia Database Metrics

Metric NamePortDescription
erlang_mnesia_held_locks8080Number of held locks
erlang_mnesia_lock_queue8080Number of transactions waiting for a lock
erlang_mnesia_transaction_participants8080Number of participant transactions
erlang_mnesia_transaction_coordinators8080Number of coordinator transactions
erlang_mnesia_failed_transactions8080Number of failed (aborted) transactions (counter)
erlang_mnesia_committed_transactions8080Number of committed transactions (counter)
erlang_mnesia_logged_transactions8080Number of transactions logged (counter)
erlang_mnesia_restarted_transactions8080Total number of transaction restarts (counter)
erlang_mnesia_memory_usage_bytes8080Total bytes allocated by all mnesia tables
erlang_mnesia_tablewise_memory_usage_bytes8080Bytes allocated per mnesia table with label: table
erlang_mnesia_tablewise_size8080Number of rows per table with label: table

Erlang VM Memory Metrics

Metric NamePortDescription
erlang_vm_memory_atom_bytes_total8080Memory allocated for atoms with label: usage (used/free)
erlang_vm_memory_bytes_total8080Total memory allocated with label: kind (system/processes)
erlang_vm_memory_dets_tables8080DETS tables count
erlang_vm_memory_ets_tables8080ETS tables count
erlang_vm_memory_processes_bytes_total8080Memory allocated for processes with label: usage (used/free)
erlang_vm_memory_system_bytes_total8080Memory for emulator (not process-related) with label: usage (atom/binary/code/ets/other)

Erlang VM Statistics

Metric NamePortDescription
erlang_vm_statistics_bytes_output_total8080Total bytes output to ports (counter)
erlang_vm_statistics_bytes_received_total8080Total bytes received through ports (counter)
erlang_vm_statistics_context_switches8080Total context switches since startup (counter)
erlang_vm_statistics_dirty_cpu_run_queue_length8080Length of dirty CPU run-queue
erlang_vm_statistics_dirty_io_run_queue_length8080Length of dirty IO run-queue
erlang_vm_statistics_garbage_collection_number_of_gcs8080Number of garbage collections (counter)
erlang_vm_statistics_garbage_collection_bytes_reclaimed8080Bytes reclaimed by GC (counter)
erlang_vm_statistics_garbage_collection_words_reclaimed8080Words reclaimed by GC (counter)
erlang_vm_statistics_reductions_total8080Total reductions (counter)
erlang_vm_statistics_run_queues_length8080Length of normal run-queues
erlang_vm_statistics_runtime_milliseconds8080Sum of runtime for all threads (counter)
erlang_vm_statistics_wallclock_time_milliseconds8080Real time measured (counter)

Erlang VM System Information

Metric NamePortDescription
erlang_vm_dirty_cpu_schedulers8080Number of dirty CPU scheduler threads
erlang_vm_dirty_cpu_schedulers_online8080Number of dirty CPU schedulers online
erlang_vm_dirty_io_schedulers8080Number of dirty I/O scheduler threads
erlang_vm_ets_limit8080Maximum number of ETS tables allowed
erlang_vm_logical_processors8080Number of logical processors configured
erlang_vm_logical_processors_available8080Number of logical processors available
erlang_vm_logical_processors_online8080Number of logical processors online
erlang_vm_port_count8080Number of ports currently existing
erlang_vm_port_limit8080Maximum number of ports allowed
erlang_vm_process_count8080Number of processes currently existing
erlang_vm_process_limit8080Maximum number of processes allowed
erlang_vm_schedulers8080Number of scheduler threads
erlang_vm_schedulers_online8080Number of schedulers online
erlang_vm_smp_support80801 if compiled with SMP support, 0 otherwise
erlang_vm_threads80801 if compiled with thread support, 0 otherwise
erlang_vm_thread_pool_size8080Number of async threads in pool
erlang_vm_time_correction80801 if time correction enabled, 0 otherwise
erlang_vm_wordsize_bytes8080Size of Erlang term words in bytes
erlang_vm_atom_count8080Number of atoms currently existing
erlang_vm_atom_limit8080Maximum number of atoms allowed

Erlang VM Microstate Accounting (MSACC)

Detailed time tracking for scheduler activities with labels: type, id

Metric NamePortDescription
erlang_vm_msacc_aux_seconds_total8080Time spent handling auxiliary jobs (counter)
erlang_vm_msacc_check_io_seconds_total8080Time spent checking for new I/O events (counter)
erlang_vm_msacc_emulator_seconds_total8080Time spent executing Erlang processes (counter)
erlang_vm_msacc_gc_seconds_total8080Time spent in garbage collection (counter)
erlang_vm_msacc_other_seconds_total8080Time spent on unaccounted activities (counter)
erlang_vm_msacc_port_seconds_total8080Time spent executing ports (counter)
erlang_vm_msacc_sleep_seconds_total8080Time spent sleeping (counter)
erlang_vm_msacc_alloc_seconds_total8080Time spent managing memory (counter)
erlang_vm_msacc_bif_seconds_total8080Time spent in BIFs (counter)
erlang_vm_msacc_busy_wait_seconds_total8080Time spent busy waiting (counter)
erlang_vm_msacc_ets_seconds_total8080Time spent in ETS BIFs (counter)
erlang_vm_msacc_gc_full_seconds_total8080Time spent in fullsweep GC (counter)
erlang_vm_msacc_nif_seconds_total8080Time spent in NIFs (counter)
erlang_vm_msacc_send_seconds_total8080Time spent sending messages (counter)
erlang_vm_msacc_timers_seconds_total8080Time spent managing timers (counter)

Erlang VM Allocators

Detailed memory allocator metrics with labels: alloc, instance_no, kind, usage

Metric NamePortDescription
erlang_vm_allocators8080Allocated (carriers_size) and used (blocks_size) memory for different allocators. See erts_alloc(3).

Allocator types include: temp_alloc, sl_alloc, std_alloc, ll_alloc, eheap_alloc, ets_alloc, fix_alloc, literal_alloc, binary_alloc, driver_alloc


↑ Back to top

Port 9093 - Media & Call Quality Metrics

These metrics provide real-time RTP/RTCP statistics and call quality information per channel.

Metric NamePortDescription
freeswitch_info9093System info with label: version
freeswitch_up9093Ready status (1=ready, 0=not ready)
freeswitch_stack_bytes9093Stack size in bytes
freeswitch_session_total9093Total number of sessions
freeswitch_session_active9093Active number of sessions
freeswitch_session_limit9093Session limit
rtp_channel_info9093RTP channel info with labels for channel details

RTP Audio - Byte Counters

Metric NamePortDescription
rtp_audio_in_raw_bytes_total9093Total bytes received (including headers)
rtp_audio_out_raw_bytes_total9093Total bytes sent (including headers)
rtp_audio_in_media_bytes_total9093Total media bytes received (payload only)
rtp_audio_out_media_bytes_total9093Total media bytes sent (payload only)

RTP Audio - Packet Counters

Metric NamePortDescription
rtp_audio_in_packets_total9093Total packets received
rtp_audio_out_packets_total9093Total packets sent
rtp_audio_in_media_packets_total9093Total media packets received
rtp_audio_out_media_packets_total9093Total media packets sent
rtp_audio_in_skip_packets_total9093Inbound packets discarded
rtp_audio_out_skip_packets_total9093Outbound packets discarded

RTP Audio - Special Packet Types

Metric NamePortDescription
rtp_audio_in_jitter_packets_total9093Jitter buffer packets received
rtp_audio_in_dtmf_packets_total9093DTMF packets received
rtp_audio_out_dtmf_packets_total9093DTMF packets sent
rtp_audio_in_cng_packets_total9093Comfort Noise Generation packets received
rtp_audio_out_cng_packets_total9093Comfort Noise Generation packets sent
rtp_audio_in_flush_packets_total9093Flushed packets (buffer resets)

RTP Audio - Jitter & Quality Metrics

Metric NamePortDescription
rtp_audio_in_jitter_buffer_bytes_max9093Largest jitter buffer size in bytes
rtp_audio_in_jitter_seconds_min9093Minimum jitter in seconds
rtp_audio_in_jitter_seconds_max9093Maximum jitter in seconds
rtp_audio_in_jitter_loss_rate9093Packet loss rate due to jitter (ratio)
rtp_audio_in_jitter_burst_rate9093Packet burst rate due to jitter (ratio)
rtp_audio_in_mean_interval_seconds9093Mean interval between inbound packets
rtp_audio_in_flaw_total9093Total audio flaws detected (glitches, artifacts)
rtp_audio_in_quality_percent9093Audio quality as percentage (0-100)
rtp_audio_in_quality_mos9093Mean Opinion Score (1-5, where 5 is best)

RTCP Metrics

Metric NamePortDescription
rtcp_audio_bytes_total9093Total RTCP bytes
rtcp_audio_packets_total9093Total RTCP packets

Go Runtime Metrics

Metric NamePortDescription
go_goroutines9090Number of goroutines currently running
go_threads9090Number of OS threads created
go_info9090Information about the Go environment (with version label)
go_gc_duration_seconds9090Pause duration of garbage collection cycles (summary)
go_memstats_alloc_bytes9090Number of bytes allocated and still in use
go_memstats_alloc_bytes_total9090Total number of bytes allocated (counter)
go_memstats_heap_alloc_bytes9090Heap bytes allocated and still in use
go_memstats_heap_idle_bytes9090Heap bytes waiting to be used
go_memstats_heap_inuse_bytes9090Heap bytes currently in use
go_memstats_heap_objects9090Number of allocated heap objects
go_memstats_heap_released_bytes9090Heap bytes released to OS
go_memstats_heap_sys_bytes9090Heap bytes obtained from system
go_memstats_sys_bytes9090Total bytes obtained from system

Process Metrics

Metric NamePortDescription
process_cpu_seconds_total9090Total user and system CPU time spent (counter)
process_max_fds9090Maximum number of open file descriptors
process_open_fds9090Current number of open file descriptors
process_resident_memory_bytes9090Resident memory size in bytes
process_virtual_memory_bytes9090Virtual memory size in bytes
process_virtual_memory_max_bytes9090Maximum amount of virtual memory available
process_start_time_seconds9090Process start time since Unix epoch

Prometheus HTTP Metrics

Metric NamePortDescription
promhttp_metric_handler_requests_in_flight9090Current number of scrapes being served
promhttp_metric_handler_requests_total9090Total number of scrapes by HTTP status code (counter)

↑ Back to top

Metric Types

  • gauge: A metric that can go up or down (e.g., current_calls, cpu_idle)
  • counter: A metric that only increases (e.g., sessions_total, failed_scrapes)
  • summary: A metric that tracks quantiles over a sliding time window (e.g., gc_duration_seconds)

↑ Back to top

Usage

To scrape these metrics, configure your Prometheus server to scrape all three endpoints:

scrape_configs:
- job_name: 'ims_as_system'
static_configs:
- targets: ['localhost:9090']

- job_name: 'ims_as_engine'
static_configs:
- targets: ['localhost:8080']
metrics_path: '/metrics'

- job_name: 'ims_as_media'
static_configs:
- targets: ['localhost:9093']
metrics_path: '/esl'
params:
module: ['default']

↑ Back to top

Example Queries

Quick Links:

General Metrics

Current call volume:

freeswitch_current_calls

Gateway health:

freeswitch_sofia_gateway_status{status="UP"}

Average ping time to gateways:

avg(freeswitch_sofia_gateway_pingtime)

Sessions per second rate:

freeswitch_current_sps

Memory usage:

freeswitch_memory_uordblks

Media Quality Metrics

Call quality (MOS score):

rtp_audio_in_quality_mos

Audio quality percentage:

rtp_audio_in_quality_percent

Jitter rate:

rate(rtp_audio_in_jitter_packets_total[5m])

Packet loss rate:

rtp_audio_in_jitter_loss_rate

Average jitter:

avg(rtp_audio_in_jitter_seconds_max - rtp_audio_in_jitter_seconds_min)

RTP bandwidth (inbound):

rate(rtp_audio_in_media_bytes_total[1m]) * 8

Audio flaws detected:

increase(rtp_audio_in_flaw_total[5m])

TAS Engine Metrics

Active calls by type:

active_calls

Diameter peer health:

diameter_peer_state{application="sh"}

Call attempt rate:

rate(call_attempts_total[5m])

HLR lookup latency (95th percentile):

histogram_quantile(0.95, hlr_data_duration_milliseconds)

OCS authorization latency:

histogram_quantile(0.99, ocs_authorization_duration_milliseconds)

Subscriber data lookup rate:

rate(subscriber_data_lookups_total[5m])

Diameter request success rate:

rate(diameter_responses_total[5m]) / rate(diameter_requests_total[5m])

Event Socket connection status:

event_socket_connected

Mnesia transaction performance:

rate(erlang_mnesia_committed_transactions[5m])

Mnesia failed transaction rate:

rate(erlang_mnesia_failed_transactions[5m])

Erlang VM process count:

erlang_vm_process_count

Erlang VM memory usage:

erlang_vm_memory_bytes_total

Garbage collection rate:

rate(erlang_vm_statistics_garbage_collection_number_of_gcs[5m])

Scheduler run queue length:

erlang_vm_statistics_run_queues_length

ETS table count:

erlang_vm_memory_ets_tables

HTTP dialplan request duration (median):

histogram_quantile(0.5, http_dialplan_request_duration_milliseconds)

↑ Back to top

Metric Time Unit Configuration

Important for Developers:

All duration metrics in this system use duration_unit: false in their Histogram declarations. This is critical because:

  1. The Prometheus Elixir library automatically detects metric names ending in _milliseconds
  2. By default, it converts native Erlang time units to milliseconds automatically
  3. Our code already converts time to milliseconds using System.convert_time_unit/3
  4. Without duration_unit: false, the library would convert milliseconds to nanoseconds (dividing by ~1,000,000)

Example:

# Correct configuration
Histogram.declare(
name: :http_dialplan_request_duration_milliseconds,
help: "Duration of HTTP dialplan requests in milliseconds",
labels: [:call_type],
buckets: [100, 250, 500, 750, 1000, 1500, 2000, 3000, 5000],
duration_unit: false # REQUIRED to prevent double conversion
)

# Measuring time correctly
start_time = System.monotonic_time()
# ... do work ...
end_time = System.monotonic_time()
duration_ms = System.convert_time_unit(end_time - start_time, :native, :millisecond)
Histogram.observe([name: :http_dialplan_request_duration_milliseconds], duration_ms)

↑ Back to top

Grafana Dashboard Integration

The metrics can be visualized in Grafana using the Prometheus data source.

Row 1: Call Volume & Health

  • Active calls gauge (active_calls)
  • Call attempts rate by type (rate(call_attempts_total[5m]))
  • Call rejection rate (rate(call_rejections_total[5m]))
  • Gateway health (freeswitch_sofia_gateway_status)

Row 2: Performance (Latency Percentiles)

  • P95 HTTP dialplan request time by call type
  • P95 Sh subscriber data lookup time
  • P95 HLR lookup time
  • P95 OCS authorization time
  • P95 Diameter response time by application

Row 3: Success Rates

  • Subscriber data lookup success rate
  • HLR lookup success rate
  • OCS authorization success rate
  • Diameter peer state

Row 4: Media Quality

  • Call quality MOS score (rtp_audio_in_quality_mos)
  • Audio quality percentage (rtp_audio_in_quality_percent)
  • Jitter statistics
  • Packet loss rate

Row 5: System Resources

  • Erlang VM process count
  • Erlang VM memory usage
  • ETS table count
  • Scheduler run queue length
  • Garbage collection rate

Row 6: Error Tracking

  • Call parameter errors
  • Authorization failures
  • Event Socket connection status
  • Mnesia transaction failures

Example Panel Queries

Active Calls by Type:

sum by (call_type) (active_calls)

P95 Dialplan Generation Latency:

histogram_quantile(0.95,
rate(http_dialplan_request_duration_milliseconds_bucket[5m])
)

Diameter Success Rate:

rate(diameter_responses_total{result="success"}[5m]) /
rate(diameter_requests_total[5m]) * 100

Media Quality - Average MOS:

avg(rtp_audio_in_quality_mos)

↑ Back to top

Alerting Examples

Critical Alerts (Page Immediately)

System Down - No Call Attempts:

alert: SystemDown
expr: rate(call_attempts_total[5m]) == 0
for: 2m
labels:
severity: critical
annotations:
summary: "TAS system appears down - no call attempts"
description: "No call attempts detected for 2 minutes"

Diameter Peer Down:

alert: DiameterPeerDown
expr: diameter_peer_state == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Diameter peer {{ $labels.peer_host }} is down"
description: "Peer for {{ $labels.application }} application is unavailable"

Event Socket Disconnected:

alert: EventSocketDisconnected
expr: event_socket_connected == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Event Socket {{ $labels.connection_type }} disconnected"
description: "Critical communication channel down"

High Severity Alerts

High Diameter Latency:

alert: HighDiameterLatency
expr: |
histogram_quantile(0.95,
rate(diameter_response_duration_milliseconds_bucket[5m])
) > 1000
for: 5m
labels:
severity: high
annotations:
summary: "High Diameter latency detected"
description: "P95 latency is {{ $value }}ms"

OCS Authorization Failures:

alert: OCSAuthFailures
expr: |
rate(ocs_authorization_attempts_total{result="no_credit"}[5m]) /
rate(ocs_authorization_attempts_total[5m]) > 0.1
for: 5m
labels:
severity: high
annotations:
summary: "High rate of OCS no-credit responses"
description: "{{ $value | humanizePercentage }} of requests denied credit"

High Call Rejection Rate:

alert: HighCallRejectionRate
expr: |
rate(call_rejections_total[5m]) /
rate(call_attempts_total[5m]) > 0.05
for: 5m
labels:
severity: high
annotations:
summary: "Call rejection rate above 5%"
description: "{{ $value | humanizePercentage }} of calls rejected"

Poor Media Quality:

alert: PoorMediaQuality
expr: avg(rtp_audio_in_quality_mos) < 3.5
for: 3m
labels:
severity: high
annotations:
summary: "Poor call quality detected"
description: "Average MOS score is {{ $value }}"

Warning Alerts

High Memory Usage:

alert: HighMemoryUsage
expr: |
erlang_vm_memory_bytes_total{kind="processes"} /
(erlang_vm_process_limit * 1000000) > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Erlang VM memory usage high"
description: "Process memory at {{ $value | humanizePercentage }}"

High Scheduler Run Queue:

alert: HighSchedulerRunQueue
expr: erlang_vm_statistics_run_queues_length > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High scheduler run queue length"
description: "Run queue length is {{ $value }}"

Mnesia Transaction Failures:

alert: MnesiaTransactionFailures
expr: rate(erlang_mnesia_failed_transactions[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Mnesia transaction failures detected"
description: "{{ $value }} failures per second"

↑ Back to top

Troubleshooting with Metrics

Problem: Metrics showing unrealistic values (nanoseconds instead of milliseconds)

Symptoms:

  • Histogram values in the billions
  • Latency metrics showing microsecond/nanosecond values

Cause: Missing duration_unit: false in Histogram declaration

Solution: Add duration_unit: false to all duration histogram declarations:

Histogram.declare(
name: :my_metric_duration_milliseconds,
# ... other options ...
duration_unit: false
)

Problem: Calls are slow

Investigation Steps:

  1. Check overall dialplan generation time:
histogram_quantile(0.95, rate(http_dialplan_request_duration_milliseconds_bucket[5m]))
  1. Break down by component:
# Subscriber data lookup
histogram_quantile(0.95, rate(subscriber_data_duration_milliseconds_bucket[5m]))

# HLR lookup
histogram_quantile(0.95, rate(hlr_data_duration_milliseconds_bucket[5m]))

# OCS authorization
histogram_quantile(0.95, rate(ocs_authorization_duration_milliseconds_bucket[5m]))
  1. Check module-specific delays:
histogram_quantile(0.95,
rate(dialplan_module_duration_milliseconds_bucket[5m])
) by (module)

Common Causes:

  • External system latency (HSS, HLR, OCS)
  • Network issues
  • Database contention
  • High system load

Problem: Calls are failing

Investigation Steps:

  1. Check call rejection reasons:
sum by (reason) (rate(call_rejections_total[5m]))
  1. Check authorization decisions:
sum by (decision) (rate(authorization_decisions_total[5m]))
  1. Check Diameter peer health:
diameter_peer_state
  1. Check Event Socket connection:
event_socket_connected

Problem: High load

Investigation Steps:

  1. Check call volume:
rate(call_attempts_total[5m])
active_calls
  1. Check Erlang VM resources:
erlang_vm_process_count
erlang_vm_statistics_run_queues_length
erlang_vm_memory_bytes_total
  1. Check garbage collection:
rate(erlang_vm_statistics_garbage_collection_number_of_gcs[5m])

Problem: Poor Media Quality

Investigation Steps:

  1. Check MOS scores:
rtp_audio_in_quality_mos
rtp_audio_in_quality_percent
  1. Check jitter:
rtp_audio_in_jitter_seconds_max
rtp_audio_in_jitter_loss_rate
  1. Check packet loss:
rtp_audio_in_skip_packets_total
rtp_audio_in_flaw_total
  1. Check bandwidth usage:
rate(rtp_audio_in_media_bytes_total[1m]) * 8

↑ Back to top

Performance Baselines

Typical Values (Well-Tuned System)

Latency (P95):

  • HTTP dialplan request: 200-500ms
  • Subscriber data (Sh) lookup: 50-150ms
  • HLR data lookup: 100-300ms
  • OCS authorization: 100-250ms
  • Diameter requests: 50-200ms
  • Dialplan module processing: 10-50ms per module

Success Rates:

  • Call completion: >95%
  • Subscriber data lookups: >99%
  • HLR lookups: >98%
  • OCS authorizations: >99% (excluding legitimate no-credit)
  • Diameter peer uptime: >99.9%

Media Quality:

  • MOS score: >4.0
  • Audio quality percentage: >80%
  • Jitter: <30ms
  • Packet loss rate: <1%

System Resources:

  • Erlang process count: <50% of limit
  • Erlang memory usage: <70% of available
  • Scheduler run queue: <5
  • ETS tables: <1000

Capacity Planning

Per-Server Capacity (recommended maximums):

  • Concurrent calls: 500-1000 (depends on hardware)
  • Calls per second: 20-50 CPS
  • Registered subscribers: 10,000-50,000

Scaling Indicators (add capacity when):

  • Active calls consistently >70% of capacity
  • Erlang process count >70% of limit
  • P95 latency degrading
  • Scheduler run queues consistently >10

↑ Back to top

Best Practices

Monitoring Strategy

  1. Set up dashboards for different audiences:

    • Operations dashboard: Call volume, success rates, system health
    • Engineering dashboard: Latency percentiles, error rates, resource usage
    • Executive dashboard: High-level KPIs, uptime, cost metrics
  2. Configure alerts at multiple levels:

    • Critical: Page on-call (system down, major outage)
    • High: Alert during business hours (degraded performance)
    • Warning: Track in ticket system (potential issues)
  3. Use appropriate time ranges:

    • Real-time monitoring: 5-minute windows
    • Troubleshooting: 15-minute to 1-hour windows
    • Capacity planning: Daily/weekly aggregates
  4. Focus on user impact:

    • Prioritize end-to-end latency metrics
    • Track success rates over individual error counters
    • Monitor media quality for user experience

Query Performance

  1. Use recording rules for frequently-used queries:
groups:
- name: ims_as_aggregations
interval: 30s
rules:
- record: job:call_attempts:rate5m
expr: rate(call_attempts_total[5m])

- record: job:dialplan_latency:p95
expr: histogram_quantile(0.95, rate(http_dialplan_request_duration_milliseconds_bucket[5m]))
  1. Avoid high-cardinality labels in queries (e.g., don't group by phone number)

  2. Use appropriate rate intervals:

    • Short-term trends: [5m]
    • Medium-term trends: [1h]
    • Long-term trends: [1d]

Metric Cardinality

Monitor cardinality to prevent Prometheus performance issues:

# Check metric cardinality
count by (__name__) ({__name__=~".+"})

High-cardinality risks:

  • Labels with unique values per call (phone numbers, call IDs)
  • Unbounded label values
  • Labels with >1000 unique values

Solution:

  • Use labels for categories, not unique identifiers
  • Aggregate high-cardinality data in external systems
  • Use recording rules to pre-aggregate

↑ Back to top