Metrics Documentation
This document describes the Prometheus metrics exposed by the IMS Application Server components.
Table of Contents
- Metrics Endpoints
- Port 9090 - System Metrics
- Port 8080 - TAS Engine Metrics
- Application Call Metrics
- Diameter Protocol Metrics
- Telephony Operations Metrics
- Online Charging System (OCS) Metrics
- Dialplan & Processing Metrics
- Event Socket Metrics
- Erlang Mnesia Database Metrics
- Erlang VM Memory Metrics
- Erlang VM Statistics
- Erlang VM System Information
- Erlang VM Microstate Accounting (MSACC)
- Erlang VM Allocators
- Port 9093 - Media & Call Quality Metrics
- Go Runtime Metrics
- Process Metrics
- Prometheus HTTP Metrics
- Metric Types
- Usage
- Example Queries
- Metric Time Unit Configuration
- Grafana Dashboard Integration
- Alerting Examples
- Troubleshooting with Metrics
- Performance Baselines
- Best Practices
Metrics Endpoints
| Port | Endpoint | Purpose | Jump to Section |
|---|---|---|---|
| 9090 | /metrics | System, gateway, and core telephony metrics | Port 9090 → |
| 8080 | /metrics | TAS engine, Diameter, HLR, OCS, and Erlang VM metrics | Port 8080 → |
| 9093 | /esl?module=default | RTP/RTCP media quality and call statistics | Port 9093 → |
Port 9090 - System Metrics
Call and Session Metrics
| Metric Name | Port | Description |
|---|---|---|
freeswitch_bridged_calls | 9090 | Number of bridged calls currently active |
freeswitch_detailed_bridged_calls | 9090 | Number of detailed bridged calls active |
freeswitch_current_calls | 9090 | Number of calls currently active |
freeswitch_detailed_calls | 9090 | Number of detailed calls active |
freeswitch_current_channels | 9090 | Number of channels currently active |
freeswitch_current_sessions | 9090 | Number of sessions currently active |
freeswitch_current_sessions_peak | 9090 | Peak number of sessions since startup |
freeswitch_current_sessions_peak_last_5min | 9090 | Peak number of sessions in the last 5 minutes |
freeswitch_sessions_total | 9090 | Total number of sessions since startup (counter) |
freeswitch_current_sps | 9090 | Current sessions per second |
freeswitch_current_sps_peak | 9090 | Peak sessions per second since startup |
freeswitch_current_sps_peak_last_5min | 9090 | Peak sessions per second in the last 5 minutes |
freeswitch_max_sessions | 9090 | Maximum number of sessions allowed |
freeswitch_max_sps | 9090 | Maximum sessions per second allowed |
System Resource Metrics
| Metric Name | Port | Description |
|---|---|---|
freeswitch_current_idle_cpu | 9090 | Current CPU idle percentage |
freeswitch_min_idle_cpu | 9090 | Minimum CPU idle percentage recorded |
freeswitch_uptime_seconds | 9090 | Uptime in seconds |
freeswitch_time_synced | 9090 | Whether system time is in sync with exporter host time (1=synced, 0=not synced) |
Memory Metrics
| Metric Name | Port | Description |
|---|---|---|
freeswitch_memory_arena | 9090 | Total non-mmapped bytes (malloc arena) |
freeswitch_memory_ordblks | 9090 | Number of free chunks |
freeswitch_memory_smblks | 9090 | Number of free fastbin blocks |
freeswitch_memory_hblks | 9090 | Number of mapped regions |
freeswitch_memory_hblkhd | 9090 | Bytes in mapped regions |
freeswitch_memory_usmblks | 9090 | Maximum total allocated space |
freeswitch_memory_fsmblks | 9090 | Free bytes held in fastbins |
freeswitch_memory_uordblks | 9090 | Total allocated space |
freeswitch_memory_fordblks | 9090 | Total free space |
freeswitch_memory_keepcost | 9090 | Topmost releasable block |
Codec Status Metrics
| Metric Name | Port | Description |
|---|---|---|
freeswitch_codec_status | 9090 | Codec status with labels: ikey (module), name (codec name), type (codec). Value=1 indicates codec is available |
Available Codecs Include:
- G.711 alaw/ulaw
- PROXY PASS-THROUGH
- PROXY VIDEO PASS-THROUGH
- RAW Signed Linear (16 bit)
- Speex
- VP8/VP9 Video
- AMR variants
- B64
- G.723.1, G.729, G.722, G.726 variants
- OPUS
- MP3
- ADPCM, GSM, LPC-10
Endpoint Status Metrics
| Metric Name | Port | Description |
|---|---|---|
freeswitch_endpoint_status | 9090 | Endpoint status with labels: ikey (module), name (endpoint name), type (endpoint). Value=1 indicates endpoint is available |
Available Endpoints Include:
- error, group, pickup, user (mod_dptools)
- loopback, null (mod_loopback)
- rtc (mod_rtc)
- rtp, sofia (mod_sofia)
- modem (mod_spandsp)
Module Status Metrics
| Metric Name | Port | Description |
|---|---|---|
freeswitch_load_module | 9090 | Module load status (1=loaded, 0=not loaded) with label: module |
Key Modules Monitored:
- mod_sofia (SIP)
- mod_conference, mod_conference_ims
- mod_opus, mod_g729, mod_amr, etc.
- mod_event_socket
- mod_dptools
- mod_python3
- mod_rtc
- And many more...
Registration Metrics
| Metric Name | Port | Description |
|---|---|---|
freeswitch_registrations | 9090 | Total number of active registrations |
freeswitch_registration_defails | 9090 | Detailed registration information with labels: expires, hostname, network_ip, network_port, network_proto, realm, reg_user, token, url |
Sofia Gateway Metrics
| Metric Name | Port | Description |
|---|---|---|
freeswitch_sofia_gateway_status | 9090 | Gateway status with labels: context, name, profile, proxy, scheme, status (UP/DOWN) |
freeswitch_sofia_gateway_call_in | 9090 | Number of inbound calls through gateway |
freeswitch_sofia_gateway_call_out | 9090 | Number of outbound calls through gateway |
freeswitch_sofia_gateway_failed_call_in | 9090 | Number of failed inbound calls |
freeswitch_sofia_gateway_failed_call_out | 9090 | Number of failed outbound calls |
freeswitch_sofia_gateway_ping | 9090 | Last ping timestamp (Unix epoch) |
freeswitch_sofia_gateway_pingtime | 9090 | Last ping time in milliseconds |
freeswitch_sofia_gateway_pingfreq | 9090 | Ping frequency in seconds |
freeswitch_sofia_gateway_pingcount | 9090 | Number of pings sent |
freeswitch_sofia_gateway_pingmin | 9090 | Minimum ping time recorded |
freeswitch_sofia_gateway_pingmax | 9090 | Maximum ping time recorded |
Exporter Health Metrics
| Metric Name | Port | Description |
|---|---|---|
freeswitch_up | 9090 | Whether the last scrape was successful (1=success, 0=failure) |
freeswitch_exporter_total_scrapes | 9090 | Total number of scrapes performed (counter) |
freeswitch_exporter_failed_scrapes | 9090 | Total number of failed scrapes (counter) |
Port 8080 - TAS Engine Metrics
These metrics are exposed by the Telephony Application Server engine and provide insight into call processing, database operations, and Erlang VM performance.
Application Call Metrics
| Metric Name | Port | Description |
|---|---|---|
call_simulations_total | 8080 | Total number of call simulations (counter) |
call_attempts_total | 8080 | Total number of call attempts (counter) |
call_rejections_total | 8080 | Total number of call rejections by reason (counter) |
call_param_errors_total | 8080 | Total number of call parameter parsing errors (counter) |
active_calls | 8080 | Number of currently active calls with labels: call_type (mo/mt/emergency) |
tracked_call_sessions | 8080 | Number of currently tracked call sessions in ETS |
Diameter Protocol Metrics
| Metric Name | Port | Description |
|---|---|---|
diameter_peer_state | 8080 | State of Diameter peers (1=up, 0=down) with labels: peer_host, peer_realm, application |
diameter_requests_total | 8080 | Total number of Diameter requests (counter) |
diameter_responses_total | 8080 | Total number of Diameter responses (counter) |
diameter_response_duration_milliseconds | 8080 | Duration of Diameter requests in milliseconds (histogram) |
Telephony Operations Metrics
| Metric Name | Port | Description |
|---|---|---|
hlr_lookups_total | 8080 | Total number of HLR lookups (counter) |
hlr_data_duration_milliseconds | 8080 | Duration of HLR data retrieval in milliseconds (histogram) |
subscriber_data_lookups_total | 8080 | Total number of subscriber data lookups (counter) |
subscriber_data_duration_milliseconds | 8080 | Duration of Sh subscriber data retrieval in milliseconds (histogram) |
ss7_map_operations_total | 8080 | Total number of SS7 MAP operations (counter) |
ss7_map_http_duration_milliseconds | 8080 | Duration of SS7 MAP HTTP requests in milliseconds (histogram) |
tracked_registrations | 8080 | Number of currently tracked SIP registrations |
Online Charging System (OCS) Metrics
| Metric Name | Port | Description |
|---|---|---|
ocs_authorization_attempts_total | 8080 | Total number of OCS authorization attempts (counter) |
ocs_authorization_duration_milliseconds | 8080 | Duration of OCS authorization in milliseconds (histogram) |
online_charging_events_total | 8080 | Total number of online charging events (counter) |
authorization_decisions_total | 8080 | Total number of authorization decisions (counter) |
Dialplan & Processing Metrics
| Metric Name | Port | Description |
|---|---|---|
http_requests_total | 8080 | Total number of HTTP requests with labels: endpoint, status_code (counter) |
http_dialplan_request_duration_milliseconds | 8080 | Duration of HTTP dialplan requests in milliseconds (histogram) |
dialplan_module_duration_milliseconds | 8080 | Duration of individual dialplan module processing (histogram) |
freeswitch_variable_set_duration_milliseconds | 8080 | Duration of variable setting operations (histogram) |
Event Socket Metrics
| Metric Name | Port | Description |
|---|---|---|
event_socket_connected | 8080 | Event Socket connection state (1=connected, 0=disconnected) with label: connection_type |
event_socket_reconnections_total | 8080 | Total number of Event Socket reconnection attempts (counter) |
Erlang Mnesia Database Metrics
| Metric Name | Port | Description |
|---|---|---|
erlang_mnesia_held_locks | 8080 | Number of held locks |
erlang_mnesia_lock_queue | 8080 | Number of transactions waiting for a lock |
erlang_mnesia_transaction_participants | 8080 | Number of participant transactions |
erlang_mnesia_transaction_coordinators | 8080 | Number of coordinator transactions |
erlang_mnesia_failed_transactions | 8080 | Number of failed (aborted) transactions (counter) |
erlang_mnesia_committed_transactions | 8080 | Number of committed transactions (counter) |
erlang_mnesia_logged_transactions | 8080 | Number of transactions logged (counter) |
erlang_mnesia_restarted_transactions | 8080 | Total number of transaction restarts (counter) |
erlang_mnesia_memory_usage_bytes | 8080 | Total bytes allocated by all mnesia tables |
erlang_mnesia_tablewise_memory_usage_bytes | 8080 | Bytes allocated per mnesia table with label: table |
erlang_mnesia_tablewise_size | 8080 | Number of rows per table with label: table |
Erlang VM Memory Metrics
| Metric Name | Port | Description |
|---|---|---|
erlang_vm_memory_atom_bytes_total | 8080 | Memory allocated for atoms with label: usage (used/free) |
erlang_vm_memory_bytes_total | 8080 | Total memory allocated with label: kind (system/processes) |
erlang_vm_memory_dets_tables | 8080 | DETS tables count |
erlang_vm_memory_ets_tables | 8080 | ETS tables count |
erlang_vm_memory_processes_bytes_total | 8080 | Memory allocated for processes with label: usage (used/free) |
erlang_vm_memory_system_bytes_total | 8080 | Memory for emulator (not process-related) with label: usage (atom/binary/code/ets/other) |
Erlang VM Statistics
| Metric Name | Port | Description |
|---|---|---|
erlang_vm_statistics_bytes_output_total | 8080 | Total bytes output to ports (counter) |
erlang_vm_statistics_bytes_received_total | 8080 | Total bytes received through ports (counter) |
erlang_vm_statistics_context_switches | 8080 | Total context switches since startup (counter) |
erlang_vm_statistics_dirty_cpu_run_queue_length | 8080 | Length of dirty CPU run-queue |
erlang_vm_statistics_dirty_io_run_queue_length | 8080 | Length of dirty IO run-queue |
erlang_vm_statistics_garbage_collection_number_of_gcs | 8080 | Number of garbage collections (counter) |
erlang_vm_statistics_garbage_collection_bytes_reclaimed | 8080 | Bytes reclaimed by GC (counter) |
erlang_vm_statistics_garbage_collection_words_reclaimed | 8080 | Words reclaimed by GC (counter) |
erlang_vm_statistics_reductions_total | 8080 | Total reductions (counter) |
erlang_vm_statistics_run_queues_length | 8080 | Length of normal run-queues |
erlang_vm_statistics_runtime_milliseconds | 8080 | Sum of runtime for all threads (counter) |
erlang_vm_statistics_wallclock_time_milliseconds | 8080 | Real time measured (counter) |
Erlang VM System Information
| Metric Name | Port | Description |
|---|---|---|
erlang_vm_dirty_cpu_schedulers | 8080 | Number of dirty CPU scheduler threads |
erlang_vm_dirty_cpu_schedulers_online | 8080 | Number of dirty CPU schedulers online |
erlang_vm_dirty_io_schedulers | 8080 | Number of dirty I/O scheduler threads |
erlang_vm_ets_limit | 8080 | Maximum number of ETS tables allowed |
erlang_vm_logical_processors | 8080 | Number of logical processors configured |
erlang_vm_logical_processors_available | 8080 | Number of logical processors available |
erlang_vm_logical_processors_online | 8080 | Number of logical processors online |
erlang_vm_port_count | 8080 | Number of ports currently existing |
erlang_vm_port_limit | 8080 | Maximum number of ports allowed |
erlang_vm_process_count | 8080 | Number of processes currently existing |
erlang_vm_process_limit | 8080 | Maximum number of processes allowed |
erlang_vm_schedulers | 8080 | Number of scheduler threads |
erlang_vm_schedulers_online | 8080 | Number of schedulers online |
erlang_vm_smp_support | 8080 | 1 if compiled with SMP support, 0 otherwise |
erlang_vm_threads | 8080 | 1 if compiled with thread support, 0 otherwise |
erlang_vm_thread_pool_size | 8080 | Number of async threads in pool |
erlang_vm_time_correction | 8080 | 1 if time correction enabled, 0 otherwise |
erlang_vm_wordsize_bytes | 8080 | Size of Erlang term words in bytes |
erlang_vm_atom_count | 8080 | Number of atoms currently existing |
erlang_vm_atom_limit | 8080 | Maximum number of atoms allowed |
Erlang VM Microstate Accounting (MSACC)
Detailed time tracking for scheduler activities with labels: type, id
| Metric Name | Port | Description |
|---|---|---|
erlang_vm_msacc_aux_seconds_total | 8080 | Time spent handling auxiliary jobs (counter) |
erlang_vm_msacc_check_io_seconds_total | 8080 | Time spent checking for new I/O events (counter) |
erlang_vm_msacc_emulator_seconds_total | 8080 | Time spent executing Erlang processes (counter) |
erlang_vm_msacc_gc_seconds_total | 8080 | Time spent in garbage collection (counter) |
erlang_vm_msacc_other_seconds_total | 8080 | Time spent on unaccounted activities (counter) |
erlang_vm_msacc_port_seconds_total | 8080 | Time spent executing ports (counter) |
erlang_vm_msacc_sleep_seconds_total | 8080 | Time spent sleeping (counter) |
erlang_vm_msacc_alloc_seconds_total | 8080 | Time spent managing memory (counter) |
erlang_vm_msacc_bif_seconds_total | 8080 | Time spent in BIFs (counter) |
erlang_vm_msacc_busy_wait_seconds_total | 8080 | Time spent busy waiting (counter) |
erlang_vm_msacc_ets_seconds_total | 8080 | Time spent in ETS BIFs (counter) |
erlang_vm_msacc_gc_full_seconds_total | 8080 | Time spent in fullsweep GC (counter) |
erlang_vm_msacc_nif_seconds_total | 8080 | Time spent in NIFs (counter) |
erlang_vm_msacc_send_seconds_total | 8080 | Time spent sending messages (counter) |
erlang_vm_msacc_timers_seconds_total | 8080 | Time spent managing timers (counter) |
Erlang VM Allocators
Detailed memory allocator metrics with labels: alloc, instance_no, kind, usage
| Metric Name | Port | Description |
|---|---|---|
erlang_vm_allocators | 8080 | Allocated (carriers_size) and used (blocks_size) memory for different allocators. See erts_alloc(3). |
Allocator types include: temp_alloc, sl_alloc, std_alloc, ll_alloc, eheap_alloc, ets_alloc, fix_alloc, literal_alloc, binary_alloc, driver_alloc
Port 9093 - Media & Call Quality Metrics
These metrics provide real-time RTP/RTCP statistics and call quality information per channel.
| Metric Name | Port | Description |
|---|---|---|
freeswitch_info | 9093 | System info with label: version |
freeswitch_up | 9093 | Ready status (1=ready, 0=not ready) |
freeswitch_stack_bytes | 9093 | Stack size in bytes |
freeswitch_session_total | 9093 | Total number of sessions |
freeswitch_session_active | 9093 | Active number of sessions |
freeswitch_session_limit | 9093 | Session limit |
rtp_channel_info | 9093 | RTP channel info with labels for channel details |
RTP Audio - Byte Counters
| Metric Name | Port | Description |
|---|---|---|
rtp_audio_in_raw_bytes_total | 9093 | Total bytes received (including headers) |
rtp_audio_out_raw_bytes_total | 9093 | Total bytes sent (including headers) |
rtp_audio_in_media_bytes_total | 9093 | Total media bytes received (payload only) |
rtp_audio_out_media_bytes_total | 9093 | Total media bytes sent (payload only) |
RTP Audio - Packet Counters
| Metric Name | Port | Description |
|---|---|---|
rtp_audio_in_packets_total | 9093 | Total packets received |
rtp_audio_out_packets_total | 9093 | Total packets sent |
rtp_audio_in_media_packets_total | 9093 | Total media packets received |
rtp_audio_out_media_packets_total | 9093 | Total media packets sent |
rtp_audio_in_skip_packets_total | 9093 | Inbound packets discarded |
rtp_audio_out_skip_packets_total | 9093 | Outbound packets discarded |
RTP Audio - Special Packet Types
| Metric Name | Port | Description |
|---|---|---|
rtp_audio_in_jitter_packets_total | 9093 | Jitter buffer packets received |
rtp_audio_in_dtmf_packets_total | 9093 | DTMF packets received |
rtp_audio_out_dtmf_packets_total | 9093 | DTMF packets sent |
rtp_audio_in_cng_packets_total | 9093 | Comfort Noise Generation packets received |
rtp_audio_out_cng_packets_total | 9093 | Comfort Noise Generation packets sent |
rtp_audio_in_flush_packets_total | 9093 | Flushed packets (buffer resets) |
RTP Audio - Jitter & Quality Metrics
| Metric Name | Port | Description |
|---|---|---|
rtp_audio_in_jitter_buffer_bytes_max | 9093 | Largest jitter buffer size in bytes |
rtp_audio_in_jitter_seconds_min | 9093 | Minimum jitter in seconds |
rtp_audio_in_jitter_seconds_max | 9093 | Maximum jitter in seconds |
rtp_audio_in_jitter_loss_rate | 9093 | Packet loss rate due to jitter (ratio) |
rtp_audio_in_jitter_burst_rate | 9093 | Packet burst rate due to jitter (ratio) |
rtp_audio_in_mean_interval_seconds | 9093 | Mean interval between inbound packets |
rtp_audio_in_flaw_total | 9093 | Total audio flaws detected (glitches, artifacts) |
rtp_audio_in_quality_percent | 9093 | Audio quality as percentage (0-100) |
rtp_audio_in_quality_mos | 9093 | Mean Opinion Score (1-5, where 5 is best) |
RTCP Metrics
| Metric Name | Port | Description |
|---|---|---|
rtcp_audio_bytes_total | 9093 | Total RTCP bytes |
rtcp_audio_packets_total | 9093 | Total RTCP packets |
Go Runtime Metrics
| Metric Name | Port | Description |
|---|---|---|
go_goroutines | 9090 | Number of goroutines currently running |
go_threads | 9090 | Number of OS threads created |
go_info | 9090 | Information about the Go environment (with version label) |
go_gc_duration_seconds | 9090 | Pause duration of garbage collection cycles (summary) |
go_memstats_alloc_bytes | 9090 | Number of bytes allocated and still in use |
go_memstats_alloc_bytes_total | 9090 | Total number of bytes allocated (counter) |
go_memstats_heap_alloc_bytes | 9090 | Heap bytes allocated and still in use |
go_memstats_heap_idle_bytes | 9090 | Heap bytes waiting to be used |
go_memstats_heap_inuse_bytes | 9090 | Heap bytes currently in use |
go_memstats_heap_objects | 9090 | Number of allocated heap objects |
go_memstats_heap_released_bytes | 9090 | Heap bytes released to OS |
go_memstats_heap_sys_bytes | 9090 | Heap bytes obtained from system |
go_memstats_sys_bytes | 9090 | Total bytes obtained from system |
Process Metrics
| Metric Name | Port | Description |
|---|---|---|
process_cpu_seconds_total | 9090 | Total user and system CPU time spent (counter) |
process_max_fds | 9090 | Maximum number of open file descriptors |
process_open_fds | 9090 | Current number of open file descriptors |
process_resident_memory_bytes | 9090 | Resident memory size in bytes |
process_virtual_memory_bytes | 9090 | Virtual memory size in bytes |
process_virtual_memory_max_bytes | 9090 | Maximum amount of virtual memory available |
process_start_time_seconds | 9090 | Process start time since Unix epoch |
Prometheus HTTP Metrics
| Metric Name | Port | Description |
|---|---|---|
promhttp_metric_handler_requests_in_flight | 9090 | Current number of scrapes being served |
promhttp_metric_handler_requests_total | 9090 | Total number of scrapes by HTTP status code (counter) |
Metric Types
- gauge: A metric that can go up or down (e.g., current_calls, cpu_idle)
- counter: A metric that only increases (e.g., sessions_total, failed_scrapes)
- summary: A metric that tracks quantiles over a sliding time window (e.g., gc_duration_seconds)
Usage
To scrape these metrics, configure your Prometheus server to scrape all three endpoints:
scrape_configs:
- job_name: 'ims_as_system'
static_configs:
- targets: ['localhost:9090']
- job_name: 'ims_as_engine'
static_configs:
- targets: ['localhost:8080']
metrics_path: '/metrics'
- job_name: 'ims_as_media'
static_configs:
- targets: ['localhost:9093']
metrics_path: '/esl'
params:
module: ['default']
Example Queries
Quick Links:
General Metrics
Current call volume:
freeswitch_current_calls
Gateway health:
freeswitch_sofia_gateway_status{status="UP"}
Average ping time to gateways:
avg(freeswitch_sofia_gateway_pingtime)
Sessions per second rate:
freeswitch_current_sps
Memory usage:
freeswitch_memory_uordblks
Media Quality Metrics
Call quality (MOS score):
rtp_audio_in_quality_mos
Audio quality percentage:
rtp_audio_in_quality_percent
Jitter rate:
rate(rtp_audio_in_jitter_packets_total[5m])
Packet loss rate:
rtp_audio_in_jitter_loss_rate
Average jitter:
avg(rtp_audio_in_jitter_seconds_max - rtp_audio_in_jitter_seconds_min)
RTP bandwidth (inbound):
rate(rtp_audio_in_media_bytes_total[1m]) * 8
Audio flaws detected:
increase(rtp_audio_in_flaw_total[5m])
TAS Engine Metrics
Active calls by type:
active_calls
Diameter peer health:
diameter_peer_state{application="sh"}
Call attempt rate:
rate(call_attempts_total[5m])
HLR lookup latency (95th percentile):
histogram_quantile(0.95, hlr_data_duration_milliseconds)
OCS authorization latency:
histogram_quantile(0.99, ocs_authorization_duration_milliseconds)
Subscriber data lookup rate:
rate(subscriber_data_lookups_total[5m])
Diameter request success rate:
rate(diameter_responses_total[5m]) / rate(diameter_requests_total[5m])
Event Socket connection status:
event_socket_connected
Mnesia transaction performance:
rate(erlang_mnesia_committed_transactions[5m])
Mnesia failed transaction rate:
rate(erlang_mnesia_failed_transactions[5m])
Erlang VM process count:
erlang_vm_process_count
Erlang VM memory usage:
erlang_vm_memory_bytes_total
Garbage collection rate:
rate(erlang_vm_statistics_garbage_collection_number_of_gcs[5m])
Scheduler run queue length:
erlang_vm_statistics_run_queues_length
ETS table count:
erlang_vm_memory_ets_tables
HTTP dialplan request duration (median):
histogram_quantile(0.5, http_dialplan_request_duration_milliseconds)
Metric Time Unit Configuration
Important for Developers:
All duration metrics in this system use duration_unit: false in their Histogram declarations. This is critical because:
- The Prometheus Elixir library automatically detects metric names ending in
_milliseconds - By default, it converts native Erlang time units to milliseconds automatically
- Our code already converts time to milliseconds using
System.convert_time_unit/3 - Without
duration_unit: false, the library would convert milliseconds to nanoseconds (dividing by ~1,000,000)
Example:
# Correct configuration
Histogram.declare(
name: :http_dialplan_request_duration_milliseconds,
help: "Duration of HTTP dialplan requests in milliseconds",
labels: [:call_type],
buckets: [100, 250, 500, 750, 1000, 1500, 2000, 3000, 5000],
duration_unit: false # REQUIRED to prevent double conversion
)
# Measuring time correctly
start_time = System.monotonic_time()
# ... do work ...
end_time = System.monotonic_time()
duration_ms = System.convert_time_unit(end_time - start_time, :native, :millisecond)
Histogram.observe([name: :http_dialplan_request_duration_milliseconds], duration_ms)
Grafana Dashboard Integration
The metrics can be visualized in Grafana using the Prometheus data source.
Recommended Dashboard Layout
Row 1: Call Volume & Health
- Active calls gauge (
active_calls) - Call attempts rate by type (
rate(call_attempts_total[5m])) - Call rejection rate (
rate(call_rejections_total[5m])) - Gateway health (
freeswitch_sofia_gateway_status)
Row 2: Performance (Latency Percentiles)
- P95 HTTP dialplan request time by call type
- P95 Sh subscriber data lookup time
- P95 HLR lookup time
- P95 OCS authorization time
- P95 Diameter response time by application
Row 3: Success Rates
- Subscriber data lookup success rate
- HLR lookup success rate
- OCS authorization success rate
- Diameter peer state
Row 4: Media Quality
- Call quality MOS score (
rtp_audio_in_quality_mos) - Audio quality percentage (
rtp_audio_in_quality_percent) - Jitter statistics
- Packet loss rate
Row 5: System Resources
- Erlang VM process count
- Erlang VM memory usage
- ETS table count
- Scheduler run queue length
- Garbage collection rate
Row 6: Error Tracking
- Call parameter errors
- Authorization failures
- Event Socket connection status
- Mnesia transaction failures
Example Panel Queries
Active Calls by Type:
sum by (call_type) (active_calls)
P95 Dialplan Generation Latency:
histogram_quantile(0.95,
rate(http_dialplan_request_duration_milliseconds_bucket[5m])
)
Diameter Success Rate:
rate(diameter_responses_total{result="success"}[5m]) /
rate(diameter_requests_total[5m]) * 100
Media Quality - Average MOS:
avg(rtp_audio_in_quality_mos)
Alerting Examples
Critical Alerts (Page Immediately)
System Down - No Call Attempts:
alert: SystemDown
expr: rate(call_attempts_total[5m]) == 0
for: 2m
labels:
severity: critical
annotations:
summary: "TAS system appears down - no call attempts"
description: "No call attempts detected for 2 minutes"
Diameter Peer Down:
alert: DiameterPeerDown
expr: diameter_peer_state == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Diameter peer {{ $labels.peer_host }} is down"
description: "Peer for {{ $labels.application }} application is unavailable"
Event Socket Disconnected:
alert: EventSocketDisconnected
expr: event_socket_connected == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Event Socket {{ $labels.connection_type }} disconnected"
description: "Critical communication channel down"
High Severity Alerts
High Diameter Latency:
alert: HighDiameterLatency
expr: |
histogram_quantile(0.95,
rate(diameter_response_duration_milliseconds_bucket[5m])
) > 1000
for: 5m
labels:
severity: high
annotations:
summary: "High Diameter latency detected"
description: "P95 latency is {{ $value }}ms"
OCS Authorization Failures:
alert: OCSAuthFailures
expr: |
rate(ocs_authorization_attempts_total{result="no_credit"}[5m]) /
rate(ocs_authorization_attempts_total[5m]) > 0.1
for: 5m
labels:
severity: high
annotations:
summary: "High rate of OCS no-credit responses"
description: "{{ $value | humanizePercentage }} of requests denied credit"
High Call Rejection Rate:
alert: HighCallRejectionRate
expr: |
rate(call_rejections_total[5m]) /
rate(call_attempts_total[5m]) > 0.05
for: 5m
labels:
severity: high
annotations:
summary: "Call rejection rate above 5%"
description: "{{ $value | humanizePercentage }} of calls rejected"
Poor Media Quality:
alert: PoorMediaQuality
expr: avg(rtp_audio_in_quality_mos) < 3.5
for: 3m
labels:
severity: high
annotations:
summary: "Poor call quality detected"
description: "Average MOS score is {{ $value }}"
Warning Alerts
High Memory Usage:
alert: HighMemoryUsage
expr: |
erlang_vm_memory_bytes_total{kind="processes"} /
(erlang_vm_process_limit * 1000000) > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Erlang VM memory usage high"
description: "Process memory at {{ $value | humanizePercentage }}"
High Scheduler Run Queue:
alert: HighSchedulerRunQueue
expr: erlang_vm_statistics_run_queues_length > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High scheduler run queue length"
description: "Run queue length is {{ $value }}"
Mnesia Transaction Failures:
alert: MnesiaTransactionFailures
expr: rate(erlang_mnesia_failed_transactions[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Mnesia transaction failures detected"
description: "{{ $value }} failures per second"
Troubleshooting with Metrics
Problem: Metrics showing unrealistic values (nanoseconds instead of milliseconds)
Symptoms:
- Histogram values in the billions
- Latency metrics showing microsecond/nanosecond values
Cause:
Missing duration_unit: false in Histogram declaration
Solution:
Add duration_unit: false to all duration histogram declarations:
Histogram.declare(
name: :my_metric_duration_milliseconds,
# ... other options ...
duration_unit: false
)
Problem: Calls are slow
Investigation Steps:
- Check overall dialplan generation time:
histogram_quantile(0.95, rate(http_dialplan_request_duration_milliseconds_bucket[5m]))
- Break down by component:
# Subscriber data lookup
histogram_quantile(0.95, rate(subscriber_data_duration_milliseconds_bucket[5m]))
# HLR lookup
histogram_quantile(0.95, rate(hlr_data_duration_milliseconds_bucket[5m]))
# OCS authorization
histogram_quantile(0.95, rate(ocs_authorization_duration_milliseconds_bucket[5m]))
- Check module-specific delays:
histogram_quantile(0.95,
rate(dialplan_module_duration_milliseconds_bucket[5m])
) by (module)
Common Causes:
- External system latency (HSS, HLR, OCS)
- Network issues
- Database contention
- High system load
Problem: Calls are failing
Investigation Steps:
- Check call rejection reasons:
sum by (reason) (rate(call_rejections_total[5m]))
- Check authorization decisions:
sum by (decision) (rate(authorization_decisions_total[5m]))
- Check Diameter peer health:
diameter_peer_state
- Check Event Socket connection:
event_socket_connected
Problem: High load
Investigation Steps:
- Check call volume:
rate(call_attempts_total[5m])
active_calls
- Check Erlang VM resources:
erlang_vm_process_count
erlang_vm_statistics_run_queues_length
erlang_vm_memory_bytes_total
- Check garbage collection:
rate(erlang_vm_statistics_garbage_collection_number_of_gcs[5m])
Problem: Poor Media Quality
Investigation Steps:
- Check MOS scores:
rtp_audio_in_quality_mos
rtp_audio_in_quality_percent
- Check jitter:
rtp_audio_in_jitter_seconds_max
rtp_audio_in_jitter_loss_rate
- Check packet loss:
rtp_audio_in_skip_packets_total
rtp_audio_in_flaw_total
- Check bandwidth usage:
rate(rtp_audio_in_media_bytes_total[1m]) * 8
Performance Baselines
Typical Values (Well-Tuned System)
Latency (P95):
- HTTP dialplan request: 200-500ms
- Subscriber data (Sh) lookup: 50-150ms
- HLR data lookup: 100-300ms
- OCS authorization: 100-250ms
- Diameter requests: 50-200ms
- Dialplan module processing: 10-50ms per module
Success Rates:
- Call completion: >95%
- Subscriber data lookups: >99%
- HLR lookups: >98%
- OCS authorizations: >99% (excluding legitimate no-credit)
- Diameter peer uptime: >99.9%
Media Quality:
- MOS score: >4.0
- Audio quality percentage: >80%
- Jitter: <30ms
- Packet loss rate: <1%
System Resources:
- Erlang process count: <50% of limit
- Erlang memory usage: <70% of available
- Scheduler run queue: <5
- ETS tables: <1000
Capacity Planning
Per-Server Capacity (recommended maximums):
- Concurrent calls: 500-1000 (depends on hardware)
- Calls per second: 20-50 CPS
- Registered subscribers: 10,000-50,000
Scaling Indicators (add capacity when):
- Active calls consistently >70% of capacity
- Erlang process count >70% of limit
- P95 latency degrading
- Scheduler run queues consistently >10
Best Practices
Monitoring Strategy
-
Set up dashboards for different audiences:
- Operations dashboard: Call volume, success rates, system health
- Engineering dashboard: Latency percentiles, error rates, resource usage
- Executive dashboard: High-level KPIs, uptime, cost metrics
-
Configure alerts at multiple levels:
- Critical: Page on-call (system down, major outage)
- High: Alert during business hours (degraded performance)
- Warning: Track in ticket system (potential issues)
-
Use appropriate time ranges:
- Real-time monitoring: 5-minute windows
- Troubleshooting: 15-minute to 1-hour windows
- Capacity planning: Daily/weekly aggregates
-
Focus on user impact:
- Prioritize end-to-end latency metrics
- Track success rates over individual error counters
- Monitor media quality for user experience
Query Performance
- Use recording rules for frequently-used queries:
groups:
- name: ims_as_aggregations
interval: 30s
rules:
- record: job:call_attempts:rate5m
expr: rate(call_attempts_total[5m])
- record: job:dialplan_latency:p95
expr: histogram_quantile(0.95, rate(http_dialplan_request_duration_milliseconds_bucket[5m]))
-
Avoid high-cardinality labels in queries (e.g., don't group by phone number)
-
Use appropriate rate intervals:
- Short-term trends:
[5m] - Medium-term trends:
[1h] - Long-term trends:
[1d]
- Short-term trends:
Metric Cardinality
Monitor cardinality to prevent Prometheus performance issues:
# Check metric cardinality
count by (__name__) ({__name__=~".+"})
High-cardinality risks:
- Labels with unique values per call (phone numbers, call IDs)
- Unbounded label values
- Labels with >1000 unique values
Solution:
- Use labels for categories, not unique identifiers
- Aggregate high-cardinality data in external systems
- Use recording rules to pre-aggregate