Skip to main content

Database Replication

← Back to Operations Guide


Table of Contents


Overview

OmniHSS uses PostgreSQL with bidirectional logical replication for multi-site high-availability deployments. Each site operates as an independent read/write primary — there is no single master or read-only replicas.

Key Characteristics

  • Both Sites Read/Write: Every site has a full primary database. Local writes are instant regardless of inter-site link quality
  • Asynchronous Replication: Changes replicate via PostgreSQL's Write-Ahead Log (WAL). No write is blocked waiting for a remote site to acknowledge
  • Partition Tolerant: If the inter-site link goes down, both sites continue operating independently. WAL segments queue locally and replay automatically when the link recovers
  • No Quorum Required: Unlike synchronous cluster approaches, there is no concept of quorum. A single-site deployment operates identically to a multi-site deployment
  • UUID Primary Keys: All records use UUIDv4 primary keys, eliminating ID collisions between sites without coordination

How This Differs from Synchronous Replication

SynchronousAsynchronous Logical Replication
Write latencyBlocked by slowest nodeLocal only — sub-millisecond
Link failureMinority site goes read-onlyBoth sites continue read/write
Quorum neededYes (majority of nodes)No
Data consistencyStrong (all nodes identical)Eventual (brief replication lag)
Suitable for satellite linksNo (latency kills write performance)Yes (WAL queues and catches up)

How It Works

Logical Replication Flow

Publication / Subscription Model

Each site has:

  • A publication that broadcasts all table changes (except schema_migrations)
  • A subscription to every peer site's publication

When the inter-site link is unavailable:

  1. Local writes continue without interruption
  2. WAL segments accumulate on the publishing side
  3. The replication slot tracks where each subscriber left off
  4. When the link recovers, queued WAL replays from the last acknowledged position
  5. Both sites converge to identical state

WAL retention is bounded by max_slot_wal_keep_size to prevent unbounded disk growth during extended outages.

Conflict Prevention

UUID primary keys eliminate the primary source of conflicts in bidirectional replication. Since each site generates globally unique identifiers independently, INSERT operations never collide.

The origin = none subscription option prevents replication loops — changes that arrived via replication are not re-published to other subscribers.

The schema_migrations table is excluded from publications because both sites run identical migrations independently.

Last-Write-Wins for Dynamic State

Dynamic state tables (subscriber_state, pdn_session, lte_call) are updated frequently by Diameter signalling — every ULR updates the serving MME, every CCR updates the PGW session, every AAR updates the active call.

In a multi-site deployment, a subscriber is served by one site at a time. When the subscriber moves between sites, both sites may have updates to the same subscriber_state row from different times. Without conflict resolution, the replication order would determine which value survives — potentially overwriting the current serving MME with a stale value.

OmniHSS uses Last-Write-Wins (LWW) triggers on dynamic state tables:

CREATE TRIGGER lww_subscriber_state_trigger
BEFORE UPDATE ON subscriber_state
FOR EACH ROW
EXECUTE FUNCTION lww_subscriber_state();

The trigger compares the updated_at timestamp of the incoming replicated update against the existing row. If the existing row is newer, the update is silently suppressed. This ensures:

  • When a subscriber moves from Site A to Site B, Site B's newer update always wins
  • Stale updates from Site A (from before the subscriber moved) are discarded
  • Both sites converge to the same value — the most recent write

Tables with LWW triggers:

TableUpdated ByFrequency
subscriber_stateULR, SAR, AIR, PUREvery attach/location update
pdn_sessionCCR-I, CCR-TEvery PDN session create/delete
lte_callAAR, STREvery VoLTE call setup/teardown

Static provisioning data (subscribers, profiles, keys, APNs) does not need LWW — these are only updated by the provisioning API, typically from a single management plane.


Deployment Architectures

Single Site (No Replication)

Multiple HSS application instances share a single PostgreSQL database. Standard PostgreSQL streaming replication can be used for local database failover if required.

Two-Site Deployment

Both sites are full primaries. Subscriber provisioning on either site replicates to the other. Each MME connects to its local HSS — there is no cross-site Diameter dependency.

Multi-Site Deployment

Each site subscribes to every other site's publication. Peers are auto-discovered from Ansible inventory — adding a site requires only adding it to the hss inventory group and running the playbook.

Network Requirements

PortProtocolPurpose
5432TCPPostgreSQL client and replication connections

Configuration Reference

Ansible Variables

Replication is configured per HSS group in your inventory:

hss:
hosts:
site-a-hss01:
ansible_host: 10.80.12.140
site-a-hss02:
ansible_host: 10.80.12.141
vars:
omnihss:
database_type: postgres
database_host: localhost
database_port: 5432
database_name: omnihss
database_username: omnihss
database_password: "secure_password"
replication:
enabled: true

Ansible Parameters

ParameterTypeRequiredDefaultDescription
database_typeStringNopostgresDatabase backend. Must be postgres for replication.
database_hostStringNolocalhostPostgreSQL host. Use localhost when Postgres runs on the same host as OmniHSS.
database_portIntegerNo5432PostgreSQL port.
database_nameStringNoomnihssDatabase name.
database_usernameStringYes-PostgreSQL user. Must have CREATEDB and REPLICATION privileges.
database_passwordStringYes-PostgreSQL password.
replication.enabledBooleanNofalseEnable bidirectional logical replication. When true, the Ansible role configures publications, subscriptions, and pg_hba.conf entries automatically.

Peer Discovery

Peers are discovered automatically from two sources:

  1. Local inventory group: All other hosts in the hss Ansible group
  2. Remote sites: Entries in connected_sites.hss from prod_master_peers.yml

No per-host peer configuration is required. Adding a host to the hss group or to connected_sites.hss and running the playbook is sufficient.

Remote Site Peers (prod_master_peers.yml)

For inter-site replication across geographic locations:

# group_vars/prod_master_peers.yml
connected_sites:
hss:
- remote-site-hss01: 10.80.20.140
- remote-site-hss02: 10.80.20.141

PostgreSQL Configuration

The Ansible role configures the following PostgreSQL parameters automatically when replication is enabled:

ParameterValuePurpose
wal_levellogicalRequired for logical replication. Includes full row data in WAL.
max_wal_senders10Maximum concurrent WAL sender processes (one per subscriber).
max_replication_slots10Maximum replication slots. Each peer uses one slot.
listen_addresses*Accept connections from peer sites.

The role also adds pg_hba.conf entries to allow replication connections from each discovered peer IP.


Peer Management

Adding a New Site

  1. Add the new host(s) to the hss inventory group, or add to connected_sites.hss in prod_master_peers.yml
  2. Run the OmniHSS playbook

The role will:

  • Install PostgreSQL and OmniHSS on the new host
  • Run database migrations
  • Create a publication for the new site
  • Create subscriptions to all existing peers
  • Update pg_hba.conf on all existing peers to allow the new site
  • Create subscriptions from existing peers to the new site

Removing a Site

ansible-playbook -i hosts site.yml --tags hss-remove-peer \
-e "peer_name=sydney confirm_remove=yes"

The confirm_remove=yes parameter is required as a safety measure. Without it, the playbook refuses to proceed and displays what would happen.

Removing a peer:

  • Drops the subscription (stops incoming replication from that peer)
  • Removes the peer's IP from pg_hba.conf
  • Data already replicated from the peer remains in the local database
  • The peer site's subscription to this site must be removed separately on the peer

Adding a Peer Ad-Hoc

For cases where you need to add a peer outside the standard playbook flow:

ansible-playbook -i hosts site.yml --tags hss-add-peer \
-e "peer_name=sydney peer_host=10.80.20.140"

Monitoring

Prometheus Metrics

The postgres_exporter service runs on each HSS host on port 9187, exposing PostgreSQL metrics to Prometheus.

Replication Metrics

MetricTypeDescription
pg_replication_slots_activeGauge1 if the replication slot is active (peer connected), 0 if inactive
pg_replication_slots_pg_wal_lsn_diffGaugeReplication lag in bytes. Distance between current WAL position and the peer's last confirmed position
pg_stat_replication_pg_current_wal_lsn_bytesGaugeCurrent WAL position on this node

Example Prometheus Queries

# Replication slot active (1 = connected, 0 = disconnected)
pg_replication_slots_active{slot_name=~"sub_from_.*"}

# Replication lag in megabytes
pg_replication_slots_pg_wal_lsn_diff / 1048576

# Alert: replication slot inactive for more than 60 seconds
pg_replication_slots_active == 0
AlertConditionSeverityDescription
Replication Slot Inactivepg_replication_slots_active == 0 for 60sCriticalPeer is disconnected. WAL is accumulating.
WAL Lag Excessivepg_replication_slots_pg_wal_lsn_diff > 104857600 for 5mWarningReplication lag exceeds 100MB. Peer may be slow or link degraded.
WAL Slot LostWAL status is lostCriticalSlot has been invalidated. Peer requires a full re-seed.

Ansible Validation

The OmniHSS playbook validates replication health at the end of every run. It asserts:

  1. All subscription workers are alive
  2. All replication slots are active
  3. WAL lag is within threshold (100MB default)
  4. WAL slot status is reserved or extended (not lost)

If any assertion fails, the playbook fails with a clear error message identifying the issue.

Manual Validation

# Check subscription workers are alive
sudo -u postgres psql -d omnihss -c \
"SELECT subname, pid IS NOT NULL as worker_alive FROM pg_stat_subscription;"

# Check replication slots and lag
sudo -u postgres psql -d omnihss -c \
"SELECT slot_name, active,
pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) as lag_bytes,
wal_status
FROM pg_replication_slots WHERE slot_type = 'logical';"

# Compare subscriber counts between sites
sudo -u postgres psql -d omnihss -c "SELECT count(*) FROM subscriber;"

Troubleshooting

Subscription Worker Dead

Symptoms: pg_stat_subscription shows pid IS NULL for a subscription. Ansible validation fails with "Dead workers."

Common Causes:

  • Table exists on publisher but not subscriber (DDL mismatch)
  • Duplicate key conflict from a manually inserted record
  • Peer unreachable for extended period causing WAL slot invalidation

Resolution:

  1. Check PostgreSQL logs for the specific error:
    journalctl -u postgresql | grep "logical replication" | tail -20
  2. If DDL mismatch: apply the missing migration on this site, then the worker will auto-restart
  3. If duplicate key: resolve the conflict manually, then re-enable the subscription:
    ALTER SUBSCRIPTION sub_from_<peer> ENABLE;

Replication Slot Inactive

Symptoms: pg_replication_slots shows active = false. Prometheus alert fires.

Common Causes:

  • Network connectivity issue between sites
  • PostgreSQL on the peer site is stopped
  • Firewall blocking port 5432 from the peer's IP

Resolution:

  1. Verify network connectivity to the peer:
    pg_isready -h <peer_ip> -U <db_user> -d omnihss
  2. Check the peer site's PostgreSQL is running
  3. Verify firewall rules allow port 5432 from this site's IP

The subscription worker will automatically reconnect when the peer becomes reachable.

WAL Slot Lost

Symptoms: pg_replication_slots shows wal_status = 'lost'. Ansible validation fails with "WAL SLOT CRITICAL."

Cause: The peer was disconnected for so long that the retained WAL exceeded max_slot_wal_keep_size and was cleaned up.

Resolution: The peer requires a full re-seed:

  1. Drop the broken subscription on this site:
    DROP SUBSCRIPTION sub_from_<peer>;
  2. Drop the broken slot on the peer site:
    SELECT pg_drop_replication_slot('sub_from_<this_site>');
  3. Re-run the OmniHSS playbook — it will recreate the subscription with copy_data = true to perform an initial sync.

Excessive WAL Lag

Symptoms: pg_replication_slots_pg_wal_lsn_diff is large and growing. Prometheus alert fires.

Common Causes:

  • High-latency or congested inter-site link
  • Peer site under heavy load (apply worker can't keep up)
  • Large bulk operation (e.g. mass provisioning) generating significant WAL volume

Resolution:

  1. Check if the lag is growing or stable:
    SELECT slot_name,
    pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) as lag_bytes
    FROM pg_replication_slots WHERE slot_type = 'logical';
  2. If stable: the peer is applying changes but behind. It will catch up.
  3. If growing: investigate the network link or peer site load.
  4. For bulk operations: lag during the operation is expected and will resolve after the operation completes.

Schema Migration Coordination

Important: Database schema changes (migrations) do not replicate via logical replication. When deploying a new OmniHSS version that includes schema changes:

  1. Apply the migration on all sites before deploying application code that depends on new columns/tables
  2. The Ansible playbook handles this automatically — it runs migrations on every host
  3. If deploying manually, run migrations on all sites first:
    /opt/omnihss/bin/hss rpc "Hss.Command.Database.migrate()"