OmniHSS Metrics and Monitoring Guide
Table of Contents
- Monitoring Overview
- Control Panel Monitoring
- Database Monitoring
- Log Monitoring
- External Monitoring Integration
- Key Performance Indicators
- Alerting Strategies
Monitoring Overview
OmniHSS provides several mechanisms for monitoring system health, performance, and subscriber activity. Operations staff should utilize a combination of these tools for comprehensive visibility.
Monitoring Layers
Control Panel Monitoring
The Control Panel provides the primary real-time monitoring interface.
Overview Page Monitoring
URL: https://[hostname]:7443/overview
Key Metrics Available
Monitored Subscriber States
| State | Indicator | What It Means |
|---|---|---|
| Idle | No location info | Subscriber powered off or out of coverage |
| Attached | MME present | Subscriber registered to network |
| PDN Active | PDN session count > 0 | Active data connection |
| IMS Registered | S-CSCF assigned | Voice services ready |
| In Call | Active call count > 0 | VoLTE call in progress |
Extracting Metrics from Overview
While the Control Panel doesn't export metrics directly, you can:
- Count visible rows for total subscribers
- Scan for green checkmarks to count enabled subscribers
- Review expanded details for state information
- Note last seen timestamps for responsiveness
Diameter Page Monitoring
URL: https://[hostname]:7443/diameter
Key Metrics
Critical Peer Monitoring
Identify critical peers and monitor their status:
| Peer Type | Criticality | Impact if Down |
|---|---|---|
| MME | High | No new LTE attachments |
| P-GW | High | No data sessions |
| S-CSCF | High | No IMS registrations |
| P-CSCF | High | No VoLTE calls |
| I-CSCF | Medium | IMS routing issues |
| AS | Low-Medium | Specific service unavailable |
Application Page Monitoring
URL: https://[hostname]:7443/application
Key Metrics
| Metric | Description | Normal Range | Action Threshold |
|---|---|---|---|
| Process Count | Active Erlang processes | Varies by load | > 90% of limit |
| Memory Usage | Total memory consumed | < 80% | > 90% |
| Uptime | Time since last restart | N/A | Track for stability |
Database Monitoring
Direct Database Queries
Connect to SQL Database to extract detailed metrics:
Subscriber Counts
Query the database to retrieve:
- Total count of all subscribers
- Count of enabled subscribers
- Count of IMS-enabled subscribers
Session Statistics
Query the database to retrieve:
- Count of active PDN sessions
- Count of active VoLTE calls
- Breakdown of PDN sessions by APN profile
Location Statistics
Query the database to retrieve:
- Subscriber count grouped by visited network (MCC-MNC combination)
- Count of subscribers currently roaming (not on home PLMN 001-001)
- Distribution of subscribers across different visited networks
Recent Activity
Query the database to retrieve:
- Count of subscribers seen in the last hour
- Distribution of subscribers by serving MME
- Timestamp analysis of last subscriber activity
Database Health Monitoring
Monitor database health by querying:
- Total database size and growth trends
- Individual table sizes and row counts
- Current database connection count
- Query performance and resource usage
Log Monitoring
Log Output
OmniHSS outputs logs to stdout/stderr, which should be captured by your process manager.
Log Levels
Key Log Patterns to Monitor
Diameter Peer Events:
[info] Diameter peer connected: mme01.epc.example.com
[warn] Diameter peer disconnected: pgw01.epc.example.com
[error] Diameter peer connection failed: timeout
Database Events:
[info] Database connection established
[error] Database connection lost: timeout
[error] Database query failed: deadlock detected
Authentication Events:
[info] Authentication successful: IMSI 001001123456789
[warn] Authentication failed: IMSI 001001123456789, invalid vector
[error] Roaming denied: IMSI 001001123456789, MCC 310 MNC 410
Log Aggregation
For production deployments, implement log aggregation:
External Monitoring Integration
Health Check Endpoint
API Health Check: GET /api/status
curl -k https://hss.example.com:8443/api/status
Expected Response:
{"status": "ok"}
HTTP Status: 200 OK
Monitoring Tool Integration
Nagios/Icinga Example
#!/bin/bash
# check_omnihss.sh
API_URL="https://hss.example.com:8443/api/status"
response=$(curl -k -s -o /dev/null -w "%{http_code}" "$API_URL" --max-time 5)
if [ "$response" = "200" ]; then
echo "OK - OmniHSS API responding"
exit 0
else
echo "CRITICAL - OmniHSS API not responding (HTTP $response)"
exit 2
fi
Prometheus Integration
Custom exporters can be created to export OmniHSS metrics to Prometheus by querying the API and database.
SNMP Integration
For SNMP-based monitoring, custom SNMP extension scripts can query the database or API for metrics and return values via SNMP OIDs.
Key Performance Indicators
Operational KPIs
Recommended KPI Thresholds
| KPI | Target | Warning | Critical |
|---|---|---|---|
| System Uptime | 99.99% | < 99.95% | < 99.9% |
| Diameter Peer Uptime | 99.9% | < 99.5% | < 99% |
| Authentication Success Rate | > 99% | < 99% | < 95% |
| Diameter Response Time | < 100ms | > 200ms | > 500ms |
| Database Query Time | < 50ms | > 100ms | > 500ms |
| Error Rate | < 0.1% | > 0.5% | > 1% |
Capacity KPIs
| Metric | Monitor | Plan Action At |
|---|---|---|
| Total Subscribers | Current count | 80% of expected capacity |
| Concurrent PDN Sessions | Active sessions | 70% of expected maximum |
| Database Size | MB used | 80% of allocated storage |
| Database Connections | Active connections | 80% of pool size |
Alerting Strategies
Alert Priorities
Alert Definitions
Critical Alerts (P1)
System Unavailable:
- API health check fails
- Control Panel inaccessible
- Database connection fails
- Action: Immediate investigation and escalation
All Diameter Peers Disconnected:
- Zero connected peers
- Action: Check network, restart if necessary
Database Down:
- Cannot connect to SQL Database
- Action: Investigate database server, restart if necessary
High Priority Alerts (P2)
Critical Diameter Peer Down:
- Primary MME disconnected
- Primary P-GW disconnected
- Primary S-CSCF disconnected
- Action: Investigate peer connectivity within 15 minutes
High Memory Usage:
- Memory > 95%
- Action: Investigate memory leak, plan restart
High Authentication Failure Rate:
-
10% of auth requests fail
- Action: Check subscriber provisioning, investigate cause
Medium Priority Alerts (P3)
Non-Critical Peer Down:
- Secondary peer disconnected
- Application Server disconnected
- Action: Investigate within 1 hour
Elevated Memory Usage:
- Memory > 85%
- Action: Monitor trend, plan capacity upgrade
Elevated Error Rate:
- Error rate > 1%
- Action: Review logs, identify root cause
Low Priority Alerts (P4)
Capacity Warning:
- Subscribers > 80% of capacity
- Database > 80% of allocated storage
- Action: Plan capacity expansion
Performance Degradation:
- Response times elevated but acceptable
- Action: Monitor and optimize queries
Alert Notification Channels
Monitoring Checklist
Daily Checks
- Review Control Panel Overview - subscriber counts normal
- Review Diameter page - all critical peers connected
- Review Application page - memory and processes within limits
- Check for error logs - no critical errors in last 24 hours
- Verify backup completed successfully
Weekly Checks
- Review capacity trends - subscriber growth
- Review performance trends - response times
- Review database size - growth rate acceptable
- Review error rates - identify patterns
- Test alert notifications - ensure working
Monthly Checks
- Capacity planning review - project 6 months ahead
- Performance optimization review - identify slow queries
- Security review - certificate expiration, vulnerabilities
- Documentation review - update runbooks
- Disaster recovery test - verify backups restore correctly