Database Replication
Table of Contents
- Overview
- How It Works
- Deployment Architectures
- Configuration Reference
- Peer Management
- Monitoring
- Troubleshooting
Overview
OmniHSS uses PostgreSQL with bidirectional logical replication for multi-site high-availability deployments. Each site operates as an independent read/write primary — there is no single master or read-only replicas.
Key Characteristics
- Both Sites Read/Write: Every site has a full primary database. Local writes are instant regardless of inter-site link quality
- Asynchronous Replication: Changes replicate via PostgreSQL's Write-Ahead Log (WAL). No write is blocked waiting for a remote site to acknowledge
- Partition Tolerant: If the inter-site link goes down, both sites continue operating independently. WAL segments queue locally and replay automatically when the link recovers
- No Quorum Required: Unlike synchronous cluster approaches, there is no concept of quorum. A single-site deployment operates identically to a multi-site deployment
- UUID Primary Keys: All records use UUIDv4 primary keys, eliminating ID collisions between sites without coordination
How This Differs from Synchronous Replication
| Synchronous | Asynchronous Logical Replication | |
|---|---|---|
| Write latency | Blocked by slowest node | Local only — sub-millisecond |
| Link failure | Minority site goes read-only | Both sites continue read/write |
| Quorum needed | Yes (majority of nodes) | No |
| Data consistency | Strong (all nodes identical) | Eventual (brief replication lag) |
| Suitable for satellite links | No (latency kills write performance) | Yes (WAL queues and catches up) |
How It Works
Logical Replication Flow
Publication / Subscription Model
Each site has:
- A publication that broadcasts all table changes (except
schema_migrations) - A subscription to every peer site's publication
WAL Queuing During Link Outage
When the inter-site link is unavailable:
- Local writes continue without interruption
- WAL segments accumulate on the publishing side
- The replication slot tracks where each subscriber left off
- When the link recovers, queued WAL replays from the last acknowledged position
- Both sites converge to identical state
WAL retention is bounded by max_slot_wal_keep_size to prevent unbounded disk growth during extended outages.
Conflict Prevention
UUID primary keys eliminate the primary source of conflicts in bidirectional replication. Since each site generates globally unique identifiers independently, INSERT operations never collide.
The origin = none subscription option prevents replication loops — changes that arrived via replication are not re-published to other subscribers.
The schema_migrations table is excluded from publications because both sites run identical migrations independently.
Last-Write-Wins for Dynamic State
Dynamic state tables (subscriber_state, pdn_session, lte_call) are updated frequently by Diameter signalling — every ULR updates the serving MME, every CCR updates the PGW session, every AAR updates the active call.
In a multi-site deployment, a subscriber is served by one site at a time. When the subscriber moves between sites, both sites may have updates to the same subscriber_state row from different times. Without conflict resolution, the replication order would determine which value survives — potentially overwriting the current serving MME with a stale value.
OmniHSS uses Last-Write-Wins (LWW) triggers on dynamic state tables:
CREATE TRIGGER lww_subscriber_state_trigger
BEFORE UPDATE ON subscriber_state
FOR EACH ROW
EXECUTE FUNCTION lww_subscriber_state();
The trigger compares the updated_at timestamp of the incoming replicated update against the existing row. If the existing row is newer, the update is silently suppressed. This ensures:
- When a subscriber moves from Site A to Site B, Site B's newer update always wins
- Stale updates from Site A (from before the subscriber moved) are discarded
- Both sites converge to the same value — the most recent write
Tables with LWW triggers:
| Table | Updated By | Frequency |
|---|---|---|
subscriber_state | ULR, SAR, AIR, PUR | Every attach/location update |
pdn_session | CCR-I, CCR-T | Every PDN session create/delete |
lte_call | AAR, STR | Every VoLTE call setup/teardown |
Static provisioning data (subscribers, profiles, keys, APNs) does not need LWW — these are only updated by the provisioning API, typically from a single management plane.
Deployment Architectures
Single Site (No Replication)
Multiple HSS application instances share a single PostgreSQL database. Standard PostgreSQL streaming replication can be used for local database failover if required.
Two-Site Deployment
Both sites are full primaries. Subscriber provisioning on either site replicates to the other. Each MME connects to its local HSS — there is no cross-site Diameter dependency.
Multi-Site Deployment
Each site subscribes to every other site's publication. Peers are auto-discovered from Ansible inventory — adding a site requires only adding it to the hss inventory group and running the playbook.
Network Requirements
| Port | Protocol | Purpose |
|---|---|---|
| 5432 | TCP | PostgreSQL client and replication connections |
Configuration Reference
Ansible Variables
Replication is configured per HSS group in your inventory:
hss:
hosts:
site-a-hss01:
ansible_host: 10.80.12.140
site-a-hss02:
ansible_host: 10.80.12.141
vars:
omnihss:
database_type: postgres
database_host: localhost
database_port: 5432
database_name: omnihss
database_username: omnihss
database_password: "secure_password"
replication:
enabled: true
Ansible Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
database_type | String | No | postgres | Database backend. Must be postgres for replication. |
database_host | String | No | localhost | PostgreSQL host. Use localhost when Postgres runs on the same host as OmniHSS. |
database_port | Integer | No | 5432 | PostgreSQL port. |
database_name | String | No | omnihss | Database name. |
database_username | String | Yes | - | PostgreSQL user. Must have CREATEDB and REPLICATION privileges. |
database_password | String | Yes | - | PostgreSQL password. |
replication.enabled | Boolean | No | false | Enable bidirectional logical replication. When true, the Ansible role configures publications, subscriptions, and pg_hba.conf entries automatically. |
Peer Discovery
Peers are discovered automatically from two sources:
- Local inventory group: All other hosts in the
hssAnsible group - Remote sites: Entries in
connected_sites.hssfromprod_master_peers.yml
No per-host peer configuration is required. Adding a host to the hss group or to connected_sites.hss and running the playbook is sufficient.
Remote Site Peers (prod_master_peers.yml)
For inter-site replication across geographic locations:
# group_vars/prod_master_peers.yml
connected_sites:
hss:
- remote-site-hss01: 10.80.20.140
- remote-site-hss02: 10.80.20.141
PostgreSQL Configuration
The Ansible role configures the following PostgreSQL parameters automatically when replication is enabled:
| Parameter | Value | Purpose |
|---|---|---|
wal_level | logical | Required for logical replication. Includes full row data in WAL. |
max_wal_senders | 10 | Maximum concurrent WAL sender processes (one per subscriber). |
max_replication_slots | 10 | Maximum replication slots. Each peer uses one slot. |
listen_addresses | * | Accept connections from peer sites. |
The role also adds pg_hba.conf entries to allow replication connections from each discovered peer IP.
Peer Management
Adding a New Site
- Add the new host(s) to the
hssinventory group, or add toconnected_sites.hssinprod_master_peers.yml - Run the OmniHSS playbook
The role will:
- Install PostgreSQL and OmniHSS on the new host
- Run database migrations
- Create a publication for the new site
- Create subscriptions to all existing peers
- Update
pg_hba.confon all existing peers to allow the new site - Create subscriptions from existing peers to the new site
Removing a Site
ansible-playbook -i hosts site.yml --tags hss-remove-peer \
-e "peer_name=sydney confirm_remove=yes"
The confirm_remove=yes parameter is required as a safety measure. Without it, the playbook refuses to proceed and displays what would happen.
Removing a peer:
- Drops the subscription (stops incoming replication from that peer)
- Removes the peer's IP from
pg_hba.conf - Data already replicated from the peer remains in the local database
- The peer site's subscription to this site must be removed separately on the peer
Adding a Peer Ad-Hoc
For cases where you need to add a peer outside the standard playbook flow:
ansible-playbook -i hosts site.yml --tags hss-add-peer \
-e "peer_name=sydney peer_host=10.80.20.140"
Monitoring
Prometheus Metrics
The postgres_exporter service runs on each HSS host on port 9187, exposing PostgreSQL metrics to Prometheus.
Replication Metrics
| Metric | Type | Description |
|---|---|---|
pg_replication_slots_active | Gauge | 1 if the replication slot is active (peer connected), 0 if inactive |
pg_replication_slots_pg_wal_lsn_diff | Gauge | Replication lag in bytes. Distance between current WAL position and the peer's last confirmed position |
pg_stat_replication_pg_current_wal_lsn_bytes | Gauge | Current WAL position on this node |
Example Prometheus Queries
# Replication slot active (1 = connected, 0 = disconnected)
pg_replication_slots_active{slot_name=~"sub_from_.*"}
# Replication lag in megabytes
pg_replication_slots_pg_wal_lsn_diff / 1048576
# Alert: replication slot inactive for more than 60 seconds
pg_replication_slots_active == 0
Recommended Alerts
| Alert | Condition | Severity | Description |
|---|---|---|---|
| Replication Slot Inactive | pg_replication_slots_active == 0 for 60s | Critical | Peer is disconnected. WAL is accumulating. |
| WAL Lag Excessive | pg_replication_slots_pg_wal_lsn_diff > 104857600 for 5m | Warning | Replication lag exceeds 100MB. Peer may be slow or link degraded. |
| WAL Slot Lost | WAL status is lost | Critical | Slot has been invalidated. Peer requires a full re-seed. |
Ansible Validation
The OmniHSS playbook validates replication health at the end of every run. It asserts:
- All subscription workers are alive
- All replication slots are active
- WAL lag is within threshold (100MB default)
- WAL slot status is
reservedorextended(notlost)
If any assertion fails, the playbook fails with a clear error message identifying the issue.
Manual Validation
# Check subscription workers are alive
sudo -u postgres psql -d omnihss -c \
"SELECT subname, pid IS NOT NULL as worker_alive FROM pg_stat_subscription;"
# Check replication slots and lag
sudo -u postgres psql -d omnihss -c \
"SELECT slot_name, active,
pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) as lag_bytes,
wal_status
FROM pg_replication_slots WHERE slot_type = 'logical';"
# Compare subscriber counts between sites
sudo -u postgres psql -d omnihss -c "SELECT count(*) FROM subscriber;"
Troubleshooting
Subscription Worker Dead
Symptoms: pg_stat_subscription shows pid IS NULL for a subscription. Ansible validation fails with "Dead workers."
Common Causes:
- Table exists on publisher but not subscriber (DDL mismatch)
- Duplicate key conflict from a manually inserted record
- Peer unreachable for extended period causing WAL slot invalidation
Resolution:
- Check PostgreSQL logs for the specific error:
journalctl -u postgresql | grep "logical replication" | tail -20 - If DDL mismatch: apply the missing migration on this site, then the worker will auto-restart
- If duplicate key: resolve the conflict manually, then re-enable the subscription:
ALTER SUBSCRIPTION sub_from_<peer> ENABLE;
Replication Slot Inactive
Symptoms: pg_replication_slots shows active = false. Prometheus alert fires.
Common Causes:
- Network connectivity issue between sites
- PostgreSQL on the peer site is stopped
- Firewall blocking port 5432 from the peer's IP
Resolution:
- Verify network connectivity to the peer:
pg_isready -h <peer_ip> -U <db_user> -d omnihss - Check the peer site's PostgreSQL is running
- Verify firewall rules allow port 5432 from this site's IP
The subscription worker will automatically reconnect when the peer becomes reachable.
WAL Slot Lost
Symptoms: pg_replication_slots shows wal_status = 'lost'. Ansible validation fails with "WAL SLOT CRITICAL."
Cause: The peer was disconnected for so long that the retained WAL exceeded max_slot_wal_keep_size and was cleaned up.
Resolution: The peer requires a full re-seed:
- Drop the broken subscription on this site:
DROP SUBSCRIPTION sub_from_<peer>; - Drop the broken slot on the peer site:
SELECT pg_drop_replication_slot('sub_from_<this_site>'); - Re-run the OmniHSS playbook — it will recreate the subscription with
copy_data = trueto perform an initial sync.
Excessive WAL Lag
Symptoms: pg_replication_slots_pg_wal_lsn_diff is large and growing. Prometheus alert fires.
Common Causes:
- High-latency or congested inter-site link
- Peer site under heavy load (apply worker can't keep up)
- Large bulk operation (e.g. mass provisioning) generating significant WAL volume
Resolution:
- Check if the lag is growing or stable:
SELECT slot_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) as lag_bytes
FROM pg_replication_slots WHERE slot_type = 'logical'; - If stable: the peer is applying changes but behind. It will catch up.
- If growing: investigate the network link or peer site load.
- For bulk operations: lag during the operation is expected and will resolve after the operation completes.
Schema Migration Coordination
Important: Database schema changes (migrations) do not replicate via logical replication. When deploying a new OmniHSS version that includes schema changes:
- Apply the migration on all sites before deploying application code that depends on new columns/tables
- The Ansible playbook handles this automatically — it runs migrations on every host
- If deploying manually, run migrations on all sites first:
/opt/omnihss/bin/hss rpc "Hss.Command.Database.migrate()"