Metrics and Monitoring
This document describes the telemetry, metrics, alarms, and health endpoints provided by OmniMSC. For overload threshold configuration, see Configuration Reference. For troubleshooting alert conditions, see Troubleshooting Guide. For the real-time dashboard view of active calls and subscriber count, see Control Panel Guide.
Telemetry Overview
OmniMSC emits Erlang/Elixir telemetry events for all significant operational activities. These events are exported as Prometheus metrics, available at the /metrics endpoint on the Phoenix HTTP port. All metric names are namespaced under omnimsc_ to avoid collisions with other applications. The System page in the control panel provides a real-time view of BEAM VM statistics including process count, memory, and scheduler load — see Control Panel Guide.
System status page showing BEAM VM metrics, MSC configuration, and supervision tree health.
Main dashboard showing live subscriber count, active calls, SS7 link state, and SIP peer status — all key indicators that are also surfaced in Prometheus metrics.
Metric definitions are declared in Omnimsc.Telemetry.Metrics.Prometheus.metrics/0. Any Prometheus-compatible scraper (Prometheus, Grafana Agent, Datadog, Victoria Metrics) can collect these metrics at the standard scrape interval.
Metrics Reference
| Metric | Type | Labels | Description |
|---|---|---|---|
omnimsc_active_calls_count | Gauge | -- | Currently active CS voice calls |
omnimsc_vlr_subscribers_count | Gauge | -- | Subscribers currently registered in VLR |
omnimsc_sccp_connections_count | Gauge | -- | Active SCCP connections (A/Iu interface) |
omnimsc_sms_sent_count | Counter | -- | Total SMS messages sent |
omnimsc_location_update_complete_count | Counter | type | Location updates completed (imsi_attach, normal, periodic) |
omnimsc_auth_failure_count | Counter | reason | Authentication failures (mac_failure, sync_failure, timeout) |
omnimsc_auth_skipped_count | Counter | -- | Auth skipped (valid existing security context) |
omnimsc_handover_attempt_count | Counter | type | Handover attempts (intra_msc_inter_system, inter_msc) |
omnimsc_paging_attempt_count | Counter | result | Paging attempts (dispatched, success, timeout) |
omnimsc_peer_status | Gauge | peer | SIP/SS7 peer link status (1=up, 0=down) |
omnimsc_ss_operation_count | Counter | operation, ss_service | Supplementary service operations |
omnimsc_ss_error_count | Counter | reason | SS operation errors |
omnimsc_ussd_request_count | Counter | routing | USSD requests (local_ss, hlr_relay) |
omnimsc_map_dialogue_duration | Histogram | operation | MAP dialogue round-trip time (ms) |
omnimsc_call_release_count | Counter | type | Call releases (mo, mt) |
Label Values
omnimsc_location_update_complete_count -- the type label distinguishes location update types per 3GPP TS 24.008:
| Value | Description |
|---|---|
imsi_attach | IMSI attach (subscriber powering on) |
normal | Normal location update (subscriber moved to new location area) |
periodic | Periodic location update (T3212 timer expiry) |
omnimsc_auth_failure_count -- the reason label identifies the failure cause:
| Value | Description |
|---|---|
mac_failure | SRES/RES mismatch -- MS response does not match expected value |
sync_failure | SQN out of range, resynchronization needed |
timeout | Authentication timer (T3260) expired without response |
omnimsc_paging_attempt_count -- the result label tracks paging outcomes:
| Value | Description |
|---|---|
dispatched | Paging sent to BSC(s) |
success | Subscriber responded to paging |
timeout | Max retries exhausted without response |
omnimsc_peer_status -- the peer label identifies the remote peer by its configured name (e.g., Default-GW, International-GW, MSC-02).
omnimsc_ss_operation_count -- the operation label identifies the SS operation (register, erase, activate, deactivate, interrogate) and the ss_service label identifies the target service (cfu, cfb, cfnry, cfnrc, cw, clip, clir, baoc, baoic).
omnimsc_ussd_request_count -- the routing label distinguishes between locally handled SS requests and those relayed to the HLR:
| Value | Description |
|---|---|
local_ss | Request handled locally by the MSC |
hlr_relay | Request relayed to the HLR via MAP |
omnimsc_call_release_count -- the type label distinguishes call direction:
| Value | Description |
|---|---|
mo | Mobile-originated call released |
mt | Mobile-terminated call released |
Example PromQL Queries
The following queries are useful starting points for dashboards and alerting rules.
Active call monitoring -- current call load on the MSC:
omnimsc_active_calls_count
Call rate -- calls released per second, averaged over five minutes:
rate(omnimsc_call_release_count[5m])
Auth failure ratio -- authentication failures per second by reason:
rate(omnimsc_auth_failure_count[5m])
Peer availability -- identify any peers that are currently down:
omnimsc_peer_status
SMS throughput -- SMS messages per second:
rate(omnimsc_sms_sent_count[5m])
Location update rate by type -- breakdown of LU activity:
sum by (type) (rate(omnimsc_location_update_complete_count[5m]))
SS operation rate by service -- supplementary service activity:
sum by (ss_service) (rate(omnimsc_ss_operation_count[5m]))
USSD routing breakdown -- local vs HLR-relayed USSD requests:
sum by (routing) (rate(omnimsc_ussd_request_count[5m]))
Alarm System
OmniMSC raises and clears alarms for conditions that require operator attention. Each alarm has a severity level and a unique identifier.
Alarm Types
| Alarm | Severity | Description |
|---|---|---|
sctp_link_down | Critical | SCTP association to STP lost |
hlr_unreachable | Critical | HLR not responding to MAP operations |
cdr_write_failure | Major | CDR file write error |
overload | Major | System overload threshold exceeded |
Alarm Telemetry Events
The alarm subsystem emits telemetry events that can be consumed by external monitoring systems or attached to Prometheus metrics:
| Event | Description |
|---|---|
[:omnimsc, :alarm, :raised] | Emitted when an alarm condition is detected. Metadata includes alarm_id, severity, source, and descriptive text. |
[:omnimsc, :alarm, :cleared] | Emitted when an alarm condition is resolved. Metadata includes alarm_id, severity, and source. |
Alarms remain active until the underlying condition is resolved, at which point the cleared event is emitted. Multiple raises of the same alarm_id without an intervening clear are deduplicated.
Health Endpoint
OmniMSC exposes a health check endpoint for use by load balancers and orchestration systems.
GET /api/health returns the overall system health status. The response indicates whether the MSC is operational and accepting traffic. A healthy response confirms that core subsystems (VLR, CC, MAP client, SIP stack) are running. An unhealthy response indicates that one or more critical subsystems have failed.
This endpoint is suitable for Kubernetes liveness and readiness probes, or for load balancer health checks in traditional deployments.
Status Endpoint
GET /api/status returns detailed system information including active call count, registered subscriber count, peer link states, alarm summary, BEAM process count, and uptime. This endpoint provides a comprehensive snapshot for operational dashboards and diagnostic purposes.
The status response includes all the information needed to assess system capacity and identify degraded components without requiring Prometheus access.
Overload Protection
OmniMSC includes a configurable overload protection mechanism that prevents the system from exceeding safe operating limits. The overload module continuously monitors four metrics and compares them against configurable thresholds.
Overload Thresholds
| Metric | Default Threshold | Description |
|---|---|---|
| Active calls | 10,000 | Maximum concurrent CS calls |
| Registered subscribers | 50,000 | Maximum subscribers in the VLR |
| BEAM process count | 500,000 | Maximum Erlang processes |
| Paging rate | 1,000/sec | Maximum paging requests per second |
When any threshold is exceeded, the overload module rejects new service requests with GSM cause 42 (switching equipment congestion). Calls already in progress are not affected. The overload state is reflected in the [:omnimsc, :overload, :state_change] telemetry event and the overload alarm.
Overload protection applies to location updates, call setup requests, and SMS transactions. Emergency calls bypass overload protection regardless of system load, per 3GPP TS 22.101.
For threshold configuration, see Configuration Reference.