Database Replication

Overview
How It Works
Deployment Architectures
Configuration Reference
Peer Management
Monitoring
Troubleshooting

Overview

OmniHSS uses PostgreSQL with bidirectional logical replication for multi-site high-availability deployments. Each site operates as an independent read/write primary — there is no single master or read-only replicas.

Key Characteristics

Both Sites Read/Write: Every site has a full primary database. Local writes are instant regardless of inter-site link quality
Asynchronous Replication: Changes replicate via PostgreSQL's Write-Ahead Log (WAL). No write is blocked waiting for a remote site to acknowledge
Partition Tolerant: If the inter-site link goes down, both sites continue operating independently. WAL segments queue locally and replay automatically when the link recovers
No Quorum Required: Unlike synchronous cluster approaches, there is no concept of quorum. A single-site deployment operates identically to a multi-site deployment
UUID Primary Keys: All records use UUIDv4 primary keys, eliminating ID collisions between sites without coordination

How This Differs from Synchronous Replication

	Synchronous	Asynchronous Logical Replication
Write latency	Blocked by slowest node	Local only — sub-millisecond
Link failure	Minority site goes read-only	Both sites continue read/write
Quorum needed	Yes (majority of nodes)	No
Data consistency	Strong (all nodes identical)	Eventual (brief replication lag)
Suitable for satellite links	No (latency kills write performance)	Yes (WAL queues and catches up)

How It Works

Logical Replication Flow

Publication / Subscription Model

Each site has:

A publication that broadcasts all table changes (except schema_migrations)
A subscription to every peer site's publication

WAL Queuing During Link Outage

When the inter-site link is unavailable:

Local writes continue without interruption
WAL segments accumulate on the publishing side
The replication slot tracks where each subscriber left off
When the link recovers, queued WAL replays from the last acknowledged position
Both sites converge to identical state

WAL retention is bounded by max_slot_wal_keep_size to prevent unbounded disk growth during extended outages.

Conflict Prevention

UUID primary keys eliminate the primary source of conflicts in bidirectional replication. Since each site generates globally unique identifiers independently, INSERT operations never collide.

The origin = none subscription option prevents replication loops — changes that arrived via replication are not re-published to other subscribers.

The schema_migrations table is excluded from publications because both sites run identical migrations independently.

Last-Write-Wins for Dynamic State

Dynamic state tables (subscriber_state, pdn_session, lte_call) are updated frequently by Diameter signalling — every ULR updates the serving MME, every CCR updates the PGW session, every AAR updates the active call.

In a multi-site deployment, a subscriber is served by one site at a time. When the subscriber moves between sites, both sites may have updates to the same subscriber_state row from different times. Without conflict resolution, the replication order would determine which value survives — potentially overwriting the current serving MME with a stale value.

OmniHSS uses Last-Write-Wins (LWW) triggers on dynamic state tables:

CREATE TRIGGER lww_subscriber_state_trigger
  BEFORE UPDATE ON subscriber_state
  FOR EACH ROW
  EXECUTE FUNCTION lww_subscriber_state();

The trigger compares the updated_at timestamp of the incoming replicated update against the existing row. If the existing row is newer, the update is silently suppressed. This ensures:

When a subscriber moves from Site A to Site B, Site B's newer update always wins
Stale updates from Site A (from before the subscriber moved) are discarded
Both sites converge to the same value — the most recent write

Tables with LWW triggers:

Table	Updated By	Frequency
`subscriber_state`	ULR, SAR, AIR, PUR	Every attach/location update
`pdn_session`	CCR-I, CCR-T	Every PDN session create/delete
`lte_call`	AAR, STR	Every VoLTE call setup/teardown

Static provisioning data (subscribers, profiles, keys, APNs) does not need LWW — these are only updated by the provisioning API, typically from a single management plane.

Deployment Architectures

Single Site (No Replication)

Multiple HSS application instances share a single PostgreSQL database. Standard PostgreSQL streaming replication can be used for local database failover if required.

Two-Site Deployment

Both sites are full primaries. Subscriber provisioning on either site replicates to the other. Each MME connects to its local HSS — there is no cross-site Diameter dependency.

Multi-Site Deployment

Each site subscribes to every other site's publication. Peers are auto-discovered from Ansible inventory — adding a site requires only adding it to the hss inventory group and running the playbook.

Network Requirements

Port	Protocol	Purpose
5432	TCP	PostgreSQL client and replication connections

Configuration Reference

Ansible Variables

Replication is configured per HSS group in your inventory:

hss:
  hosts:
    site-a-hss01:
      ansible_host: 10.80.12.140
    site-a-hss02:
      ansible_host: 10.80.12.141
  vars:
    omnihss:
      database_type: postgres
      database_host: localhost
      database_port: 5432
      database_name: omnihss
      database_username: omnihss
      database_password: "secure_password"
      replication:
        enabled: true

Ansible Parameters

Parameter	Type	Required	Default	Description
`database_type`	String	No	`postgres`	Database backend. Must be `postgres` for replication.
`database_host`	String	No	`localhost`	PostgreSQL host. Use `localhost` when Postgres runs on the same host as OmniHSS.
`database_port`	Integer	No	`5432`	PostgreSQL port.
`database_name`	String	No	`omnihss`	Database name.
`database_username`	String	Yes	-	PostgreSQL user. Must have `CREATEDB` and `REPLICATION` privileges.
`database_password`	String	Yes	-	PostgreSQL password.
`replication.enabled`	Boolean	No	`false`	Enable bidirectional logical replication. When `true`, the Ansible role configures publications, subscriptions, and `pg_hba.conf` entries automatically.

Peer Discovery

Peers are discovered automatically from two sources:

Local inventory group: All other hosts in the hss Ansible group
Remote sites: Entries in connected_sites.hss from prod_master_peers.yml

No per-host peer configuration is required. Adding a host to the hss group or to connected_sites.hss and running the playbook is sufficient.

Remote Site Peers (prod_master_peers.yml)

For inter-site replication across geographic locations:

# group_vars/prod_master_peers.yml
connected_sites:
  hss:
    - remote-site-hss01: 10.80.20.140
    - remote-site-hss02: 10.80.20.141

PostgreSQL Configuration

The Ansible role configures the following PostgreSQL parameters automatically when replication is enabled:

Parameter	Value	Purpose
`wal_level`	`logical`	Required for logical replication. Includes full row data in WAL.
`max_wal_senders`	`10`	Maximum concurrent WAL sender processes (one per subscriber).
`max_replication_slots`	`10`	Maximum replication slots. Each peer uses one slot.
`listen_addresses`	`*`	Accept connections from peer sites.

The role also adds pg_hba.conf entries to allow replication connections from each discovered peer IP.

Peer Management

Adding a New Site

Add the new host(s) to the hss inventory group, or add to connected_sites.hss in prod_master_peers.yml
Run the OmniHSS playbook

The role will:

Install PostgreSQL and OmniHSS on the new host
Run database migrations
Create a publication for the new site
Create subscriptions to all existing peers
Update pg_hba.conf on all existing peers to allow the new site
Create subscriptions from existing peers to the new site

Removing a Site

ansible-playbook -i hosts site.yml --tags hss-remove-peer \
  -e "peer_name=sydney confirm_remove=yes"

The confirm_remove=yes parameter is required as a safety measure. Without it, the playbook refuses to proceed and displays what would happen.

Removing a peer:

Drops the subscription (stops incoming replication from that peer)
Removes the peer's IP from pg_hba.conf
Data already replicated from the peer remains in the local database
The peer site's subscription to this site must be removed separately on the peer

Adding a Peer Ad-Hoc

For cases where you need to add a peer outside the standard playbook flow:

ansible-playbook -i hosts site.yml --tags hss-add-peer \
  -e "peer_name=sydney peer_host=10.80.20.140"

Monitoring

Prometheus Metrics

The postgres_exporter service runs on each HSS host on port 9187, exposing PostgreSQL metrics to Prometheus.

Replication Metrics

Metric	Type	Description
`pg_replication_slots_active`	Gauge	`1` if the replication slot is active (peer connected), `0` if inactive
`pg_replication_slots_pg_wal_lsn_diff`	Gauge	Replication lag in bytes. Distance between current WAL position and the peer's last confirmed position
`pg_stat_replication_pg_current_wal_lsn_bytes`	Gauge	Current WAL position on this node

Example Prometheus Queries

# Replication slot active (1 = connected, 0 = disconnected)
pg_replication_slots_active{slot_name=~"sub_from_.*"}

# Replication lag in megabytes
pg_replication_slots_pg_wal_lsn_diff / 1048576

# Alert: replication slot inactive for more than 60 seconds
pg_replication_slots_active == 0

Recommended Alerts

Alert	Condition	Severity	Description
Replication Slot Inactive	`pg_replication_slots_active == 0` for 60s	Critical	Peer is disconnected. WAL is accumulating.
WAL Lag Excessive	`pg_replication_slots_pg_wal_lsn_diff > 104857600` for 5m	Warning	Replication lag exceeds 100MB. Peer may be slow or link degraded.
WAL Slot Lost	WAL status is `lost`	Critical	Slot has been invalidated. Peer requires a full re-seed.

Ansible Validation

The OmniHSS playbook validates replication health at the end of every run. It asserts:

All subscription workers are alive
All replication slots are active
WAL lag is within threshold (100MB default)
WAL slot status is reserved or extended (not lost)

If any assertion fails, the playbook fails with a clear error message identifying the issue.

Manual Validation

# Check subscription workers are alive
sudo -u postgres psql -d omnihss -c \
  "SELECT subname, pid IS NOT NULL as worker_alive FROM pg_stat_subscription;"

# Check replication slots and lag
sudo -u postgres psql -d omnihss -c \
  "SELECT slot_name, active,
    pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) as lag_bytes,
    wal_status
  FROM pg_replication_slots WHERE slot_type = 'logical';"

# Compare subscriber counts between sites
sudo -u postgres psql -d omnihss -c "SELECT count(*) FROM subscriber;"

Troubleshooting

Subscription Worker Dead

Symptoms: pg_stat_subscription shows pid IS NULL for a subscription. Ansible validation fails with "Dead workers."

Common Causes:

Table exists on publisher but not subscriber (DDL mismatch)
Duplicate key conflict from a manually inserted record
Peer unreachable for extended period causing WAL slot invalidation

Resolution:

Check PostgreSQL logs for the specific error:

journalctl -u postgresql | grep "logical replication" | tail -20

If DDL mismatch: apply the missing migration on this site, then the worker will auto-restart
If duplicate key: resolve the conflict manually, then re-enable the subscription:
```
ALTER SUBSCRIPTION sub_from_<peer> ENABLE;
```

Replication Slot Inactive

Symptoms: pg_replication_slots shows active = false. Prometheus alert fires.

Common Causes:

Network connectivity issue between sites
PostgreSQL on the peer site is stopped
Firewall blocking port 5432 from the peer's IP

Resolution:

Verify network connectivity to the peer:

pg_isready -h <peer_ip> -U <db_user> -d omnihss

Check the peer site's PostgreSQL is running
Verify firewall rules allow port 5432 from this site's IP

The subscription worker will automatically reconnect when the peer becomes reachable.

WAL Slot Lost

Symptoms: pg_replication_slots shows wal_status = 'lost'. Ansible validation fails with "WAL SLOT CRITICAL."

Cause: The peer was disconnected for so long that the retained WAL exceeded max_slot_wal_keep_size and was cleaned up.

Resolution: The peer requires a full re-seed:

Drop the broken subscription on this site:
```
DROP SUBSCRIPTION sub_from_<peer>;
```

Drop the broken slot on the peer site:

SELECT pg_drop_replication_slot('sub_from_<this_site>');

Re-run the OmniHSS playbook — it will recreate the subscription with copy_data = true to perform an initial sync.

Excessive WAL Lag

Symptoms: pg_replication_slots_pg_wal_lsn_diff is large and growing. Prometheus alert fires.

Common Causes:

High-latency or congested inter-site link
Peer site under heavy load (apply worker can't keep up)
Large bulk operation (e.g. mass provisioning) generating significant WAL volume

Resolution:

Check if the lag is growing or stable:

SELECT slot_name,
  pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) as lag_bytes
FROM pg_replication_slots WHERE slot_type = 'logical';

If stable: the peer is applying changes but behind. It will catch up.
If growing: investigate the network link or peer site load.
For bulk operations: lag during the operation is expected and will resolve after the operation completes.

Schema Migration Coordination

Important: Database schema changes (migrations) do not replicate via logical replication. When deploying a new OmniHSS version that includes schema changes:

Apply the migration on all sites before deploying application code that depends on new columns/tables
The Ansible playbook handles this automatically — it runs migrations on every host

If deploying manually, run migrations on all sites first:

/opt/omnihss/bin/hss rpc "Hss.Command.Database.migrate()"

Table of Contents​

Overview​

Key Characteristics​

How This Differs from Synchronous Replication​

How It Works​

Logical Replication Flow​

Publication / Subscription Model​

WAL Queuing During Link Outage​

Conflict Prevention​

Last-Write-Wins for Dynamic State​

Deployment Architectures​

Single Site (No Replication)​

Two-Site Deployment​

Multi-Site Deployment​

Network Requirements​

Configuration Reference​

Ansible Variables​

Ansible Parameters​

Peer Discovery​

Remote Site Peers (prod_master_peers.yml)​

PostgreSQL Configuration​

Peer Management​

Adding a New Site​

Removing a Site​

Adding a Peer Ad-Hoc​

Monitoring​

Prometheus Metrics​

Replication Metrics​

Example Prometheus Queries​

Recommended Alerts​

Ansible Validation​

Manual Validation​

Troubleshooting​

Subscription Worker Dead​

Replication Slot Inactive​

WAL Slot Lost​

Excessive WAL Lag​

Schema Migration Coordination​

Table of Contents

Overview

Key Characteristics

How This Differs from Synchronous Replication

How It Works

Logical Replication Flow

Publication / Subscription Model

WAL Queuing During Link Outage

Conflict Prevention

Last-Write-Wins for Dynamic State

Deployment Architectures

Single Site (No Replication)

Two-Site Deployment

Multi-Site Deployment

Network Requirements

Configuration Reference

Ansible Variables

Ansible Parameters

Peer Discovery

Remote Site Peers (prod_master_peers.yml)

PostgreSQL Configuration

Peer Management

Adding a New Site

Removing a Site

Adding a Peer Ad-Hoc

Monitoring

Prometheus Metrics

Replication Metrics

Example Prometheus Queries

Recommended Alerts

Ansible Validation

Manual Validation

Troubleshooting

Subscription Worker Dead

Replication Slot Inactive

WAL Slot Lost

Excessive WAL Lag

Schema Migration Coordination