Skip to main content

Metrics and Monitoring

This document describes the telemetry, metrics, alarms, and health endpoints provided by OmniMSC. For overload threshold configuration, see Configuration Reference. For troubleshooting alert conditions, see Troubleshooting Guide. For the real-time dashboard view of active calls and subscriber count, see Control Panel Guide.


Telemetry Overview

OmniMSC emits Erlang/Elixir telemetry events for all significant operational activities. These events are exported as Prometheus metrics, available at the /metrics endpoint on the Phoenix HTTP port. All metric names are namespaced under omnimsc_ to avoid collisions with other applications. The System page in the control panel provides a real-time view of BEAM VM statistics including process count, memory, and scheduler load — see Control Panel Guide.

System status page showing BEAM VM metrics, MSC configuration, and supervision tree health.

Main dashboard showing live subscriber count, active calls, SS7 link state, and SIP peer status — all key indicators that are also surfaced in Prometheus metrics.

Metric definitions are declared in Omnimsc.Telemetry.Metrics.Prometheus.metrics/0. Any Prometheus-compatible scraper (Prometheus, Grafana Agent, Datadog, Victoria Metrics) can collect these metrics at the standard scrape interval.


Metrics Reference

MetricTypeLabelsDescription
omnimsc_active_calls_countGauge--Currently active CS voice calls
omnimsc_vlr_subscribers_countGauge--Subscribers currently registered in VLR
omnimsc_sccp_connections_countGauge--Active SCCP connections (A/Iu interface)
omnimsc_sms_sent_countCounter--Total SMS messages sent
omnimsc_location_update_complete_countCountertypeLocation updates completed (imsi_attach, normal, periodic)
omnimsc_auth_failure_countCounterreasonAuthentication failures (mac_failure, sync_failure, timeout)
omnimsc_auth_skipped_countCounter--Auth skipped (valid existing security context)
omnimsc_handover_attempt_countCountertypeHandover attempts (intra_msc_inter_system, inter_msc)
omnimsc_paging_attempt_countCounterresultPaging attempts (dispatched, success, timeout)
omnimsc_peer_statusGaugepeerSIP/SS7 peer link status (1=up, 0=down)
omnimsc_ss_operation_countCounteroperation, ss_serviceSupplementary service operations
omnimsc_ss_error_countCounterreasonSS operation errors
omnimsc_ussd_request_countCounterroutingUSSD requests (local_ss, hlr_relay)
omnimsc_map_dialogue_durationHistogramoperationMAP dialogue round-trip time (ms)
omnimsc_call_release_countCountertypeCall releases (mo, mt)

Label Values

omnimsc_location_update_complete_count -- the type label distinguishes location update types per 3GPP TS 24.008:

ValueDescription
imsi_attachIMSI attach (subscriber powering on)
normalNormal location update (subscriber moved to new location area)
periodicPeriodic location update (T3212 timer expiry)

omnimsc_auth_failure_count -- the reason label identifies the failure cause:

ValueDescription
mac_failureSRES/RES mismatch -- MS response does not match expected value
sync_failureSQN out of range, resynchronization needed
timeoutAuthentication timer (T3260) expired without response

omnimsc_paging_attempt_count -- the result label tracks paging outcomes:

ValueDescription
dispatchedPaging sent to BSC(s)
successSubscriber responded to paging
timeoutMax retries exhausted without response

omnimsc_peer_status -- the peer label identifies the remote peer by its configured name (e.g., Default-GW, International-GW, MSC-02).

omnimsc_ss_operation_count -- the operation label identifies the SS operation (register, erase, activate, deactivate, interrogate) and the ss_service label identifies the target service (cfu, cfb, cfnry, cfnrc, cw, clip, clir, baoc, baoic).

omnimsc_ussd_request_count -- the routing label distinguishes between locally handled SS requests and those relayed to the HLR:

ValueDescription
local_ssRequest handled locally by the MSC
hlr_relayRequest relayed to the HLR via MAP

omnimsc_call_release_count -- the type label distinguishes call direction:

ValueDescription
moMobile-originated call released
mtMobile-terminated call released

Example PromQL Queries

The following queries are useful starting points for dashboards and alerting rules.

Active call monitoring -- current call load on the MSC:

omnimsc_active_calls_count

Call rate -- calls released per second, averaged over five minutes:

rate(omnimsc_call_release_count[5m])

Auth failure ratio -- authentication failures per second by reason:

rate(omnimsc_auth_failure_count[5m])

Peer availability -- identify any peers that are currently down:

omnimsc_peer_status

SMS throughput -- SMS messages per second:

rate(omnimsc_sms_sent_count[5m])

Location update rate by type -- breakdown of LU activity:

sum by (type) (rate(omnimsc_location_update_complete_count[5m]))

SS operation rate by service -- supplementary service activity:

sum by (ss_service) (rate(omnimsc_ss_operation_count[5m]))

USSD routing breakdown -- local vs HLR-relayed USSD requests:

sum by (routing) (rate(omnimsc_ussd_request_count[5m]))


Alarm System

OmniMSC raises and clears alarms for conditions that require operator attention. Each alarm has a severity level and a unique identifier.

Alarm Types

AlarmSeverityDescription
sctp_link_downCriticalSCTP association to STP lost
hlr_unreachableCriticalHLR not responding to MAP operations
cdr_write_failureMajorCDR file write error
overloadMajorSystem overload threshold exceeded

Alarm Telemetry Events

The alarm subsystem emits telemetry events that can be consumed by external monitoring systems or attached to Prometheus metrics:

EventDescription
[:omnimsc, :alarm, :raised]Emitted when an alarm condition is detected. Metadata includes alarm_id, severity, source, and descriptive text.
[:omnimsc, :alarm, :cleared]Emitted when an alarm condition is resolved. Metadata includes alarm_id, severity, and source.

Alarms remain active until the underlying condition is resolved, at which point the cleared event is emitted. Multiple raises of the same alarm_id without an intervening clear are deduplicated.


Health Endpoint

OmniMSC exposes a health check endpoint for use by load balancers and orchestration systems.

GET /api/health returns the overall system health status. The response indicates whether the MSC is operational and accepting traffic. A healthy response confirms that core subsystems (VLR, CC, MAP client, SIP stack) are running. An unhealthy response indicates that one or more critical subsystems have failed.

This endpoint is suitable for Kubernetes liveness and readiness probes, or for load balancer health checks in traditional deployments.


Status Endpoint

GET /api/status returns detailed system information including active call count, registered subscriber count, peer link states, alarm summary, BEAM process count, and uptime. This endpoint provides a comprehensive snapshot for operational dashboards and diagnostic purposes.

The status response includes all the information needed to assess system capacity and identify degraded components without requiring Prometheus access.


Overload Protection

OmniMSC includes a configurable overload protection mechanism that prevents the system from exceeding safe operating limits. The overload module continuously monitors four metrics and compares them against configurable thresholds.

Overload Thresholds

MetricDefault ThresholdDescription
Active calls10,000Maximum concurrent CS calls
Registered subscribers50,000Maximum subscribers in the VLR
BEAM process count500,000Maximum Erlang processes
Paging rate1,000/secMaximum paging requests per second

When any threshold is exceeded, the overload module rejects new service requests with GSM cause 42 (switching equipment congestion). Calls already in progress are not affected. The overload state is reflected in the [:omnimsc, :overload, :state_change] telemetry event and the overload alarm.

Overload protection applies to location updates, call setup requests, and SMS transactions. Emergency calls bypass overload protection regardless of system load, per 3GPP TS 22.101.

For threshold configuration, see Configuration Reference.