Metrics and Monitoring

This document describes the telemetry, metrics, alarms, and health endpoints provided by OmniMSC. For overload threshold configuration, see Configuration Reference. For troubleshooting alert conditions, see Troubleshooting Guide. For the real-time dashboard view of active calls and subscriber count, see Control Panel Guide.

Telemetry Overview

OmniMSC emits Erlang/Elixir telemetry events for all significant operational activities. These events are exported as Prometheus metrics, available at the /metrics endpoint on the Phoenix HTTP port. All metric names are namespaced under omnimsc_ to avoid collisions with other applications. The System page in the control panel provides a real-time view of BEAM VM statistics including process count, memory, and scheduler load — see Control Panel Guide.

System status page showing BEAM VM metrics, MSC configuration, and supervision tree health.

Main dashboard showing live subscriber count, active calls, SS7 link state, and SIP peer status — all key indicators that are also surfaced in Prometheus metrics.

Metric definitions are declared in Omnimsc.Telemetry.Metrics.Prometheus.metrics/0. Any Prometheus-compatible scraper (Prometheus, Grafana Agent, Datadog, Victoria Metrics) can collect these metrics at the standard scrape interval.

Metrics Reference

Metric	Type	Labels	Description
`omnimsc_active_calls_count`	Gauge	--	Currently active CS voice calls
`omnimsc_vlr_subscribers_count`	Gauge	--	Subscribers currently registered in VLR
`omnimsc_sccp_connections_count`	Gauge	--	Active SCCP connections (A/Iu interface)
`omnimsc_sms_sent_count`	Counter	--	Total SMS messages sent
`omnimsc_location_update_complete_count`	Counter	`type`	Location updates completed (imsi_attach, normal, periodic)
`omnimsc_auth_failure_count`	Counter	`reason`	Authentication failures (mac_failure, sync_failure, timeout)
`omnimsc_auth_skipped_count`	Counter	--	Auth skipped (valid existing security context)
`omnimsc_handover_attempt_count`	Counter	`type`	Handover attempts (intra_msc_inter_system, inter_msc)
`omnimsc_paging_attempt_count`	Counter	`result`	Paging attempts (dispatched, success, timeout)
`omnimsc_peer_status`	Gauge	`peer`	SIP/SS7 peer link status (1=up, 0=down)
`omnimsc_ss_operation_count`	Counter	`operation`, `ss_service`	Supplementary service operations
`omnimsc_ss_error_count`	Counter	`reason`	SS operation errors
`omnimsc_ussd_request_count`	Counter	`routing`	USSD requests (local_ss, hlr_relay)
`omnimsc_map_dialogue_duration`	Histogram	`operation`	MAP dialogue round-trip time (ms)
`omnimsc_call_release_count`	Counter	`type`	Call releases (mo, mt)

Label Values

omnimsc_location_update_complete_count -- the type label distinguishes location update types per 3GPP TS 24.008:

Value	Description
`imsi_attach`	IMSI attach (subscriber powering on)
`normal`	Normal location update (subscriber moved to new location area)
`periodic`	Periodic location update (T3212 timer expiry)

omnimsc_auth_failure_count -- the reason label identifies the failure cause:

Value	Description
`mac_failure`	SRES/RES mismatch -- MS response does not match expected value
`sync_failure`	SQN out of range, resynchronization needed
`timeout`	Authentication timer (T3260) expired without response

omnimsc_paging_attempt_count -- the result label tracks paging outcomes:

Value	Description
`dispatched`	Paging sent to BSC(s)
`success`	Subscriber responded to paging
`timeout`	Max retries exhausted without response

omnimsc_peer_status -- the peer label identifies the remote peer by its configured name (e.g., Default-GW, International-GW, MSC-02).

omnimsc_ss_operation_count -- the operation label identifies the SS operation (register, erase, activate, deactivate, interrogate) and the ss_service label identifies the target service (cfu, cfb, cfnry, cfnrc, cw, clip, clir, baoc, baoic).

omnimsc_ussd_request_count -- the routing label distinguishes between locally handled SS requests and those relayed to the HLR:

Value	Description
`local_ss`	Request handled locally by the MSC
`hlr_relay`	Request relayed to the HLR via MAP

omnimsc_call_release_count -- the type label distinguishes call direction:

Value	Description
`mo`	Mobile-originated call released
`mt`	Mobile-terminated call released

Example PromQL Queries

The following queries are useful starting points for dashboards and alerting rules.

Active call monitoring -- current call load on the MSC:

omnimsc_active_calls_count

Call rate -- calls released per second, averaged over five minutes:

rate(omnimsc_call_release_count[5m])

Auth failure ratio -- authentication failures per second by reason:

rate(omnimsc_auth_failure_count[5m])

Peer availability -- identify any peers that are currently down:

omnimsc_peer_status

SMS throughput -- SMS messages per second:

rate(omnimsc_sms_sent_count[5m])

Location update rate by type -- breakdown of LU activity:

sum by (type) (rate(omnimsc_location_update_complete_count[5m]))

SS operation rate by service -- supplementary service activity:

sum by (ss_service) (rate(omnimsc_ss_operation_count[5m]))

USSD routing breakdown -- local vs HLR-relayed USSD requests:

sum by (routing) (rate(omnimsc_ussd_request_count[5m]))

Alarm System

OmniMSC raises and clears alarms for conditions that require operator attention. Each alarm has a severity level and a unique identifier.

Alarm Types

Alarm	Severity	Description
`sctp_link_down`	Critical	SCTP association to STP lost
`hlr_unreachable`	Critical	HLR not responding to MAP operations
`cdr_write_failure`	Major	CDR file write error
`overload`	Major	System overload threshold exceeded

Alarm Telemetry Events

The alarm subsystem emits telemetry events that can be consumed by external monitoring systems or attached to Prometheus metrics:

Event	Description
`[:omnimsc, :alarm, :raised]`	Emitted when an alarm condition is detected. Metadata includes alarm_id, severity, source, and descriptive text.
`[:omnimsc, :alarm, :cleared]`	Emitted when an alarm condition is resolved. Metadata includes alarm_id, severity, and source.

Alarms remain active until the underlying condition is resolved, at which point the cleared event is emitted. Multiple raises of the same alarm_id without an intervening clear are deduplicated.

Health Endpoint

OmniMSC exposes a health check endpoint for use by load balancers and orchestration systems.

GET /api/health returns the overall system health status. The response indicates whether the MSC is operational and accepting traffic. A healthy response confirms that core subsystems (VLR, CC, MAP client, SIP stack) are running. An unhealthy response indicates that one or more critical subsystems have failed.

This endpoint is suitable for Kubernetes liveness and readiness probes, or for load balancer health checks in traditional deployments.

Status Endpoint

GET /api/status returns detailed system information including active call count, registered subscriber count, peer link states, alarm summary, BEAM process count, and uptime. This endpoint provides a comprehensive snapshot for operational dashboards and diagnostic purposes.

The status response includes all the information needed to assess system capacity and identify degraded components without requiring Prometheus access.

Overload Protection

OmniMSC includes a configurable overload protection mechanism that prevents the system from exceeding safe operating limits. The overload module continuously monitors four metrics and compares them against configurable thresholds.

Overload Thresholds

Metric	Default Threshold	Description
Active calls	10,000	Maximum concurrent CS calls
Registered subscribers	50,000	Maximum subscribers in the VLR
BEAM process count	500,000	Maximum Erlang processes
Paging rate	1,000/sec	Maximum paging requests per second

When any threshold is exceeded, the overload module rejects new service requests with GSM cause 42 (switching equipment congestion). Calls already in progress are not affected. The overload state is reflected in the [:omnimsc, :overload, :state_change] telemetry event and the overload alarm.

Overload protection applies to location updates, call setup requests, and SMS transactions. Emergency calls bypass overload protection regardless of system load, per 3GPP TS 22.101.

For threshold configuration, see Configuration Reference.

Telemetry Overview​

Metrics Reference​

Label Values​

Example PromQL Queries​

Alarm System​

Alarm Types​

Alarm Telemetry Events​

Health Endpoint​

Status Endpoint​

Overload Protection​

Overload Thresholds​