PostgreSQL HA Cluster | Server Configuration
Client Apps : psql
| pg_ctl
| initdb
| pg_basebackup
| …
Configuration (fnames/locations) : pg_hba.conf
- Log-Shipping Standby Servers
- Streaming-Replication Protocol
- Streaming Replication
- Standby Servers
- Standby Server Operation
- Replication params
- Replication Functions
SELECT pg_drop_replication_slot('foo');
- Backup/Restore
- Continuous Archiving
- Continuouos Archiving @ Standby
- Archive Recovery
recovery.signal
file,archive_command
,recovery_command
recovery.conf
: DEPRICATED (v12)
- Continuous Archiving
- System Administration Functions
- Server Configuration
- Authentication Methods
TL;DR
Streaming Replication : Symmetric Servers
Streaming Replication with Replication Slots implemented on a symmetrical pair of containerized PostgreSQL servers running in a Docker swarm stack. This is a native PostgreSQL (v12/13) implementation, sans external dependencies. The servers are stateless. State is maintained in the peristent data store mounted thereto.
The functional difference between the two servers is that one is in primary (read/write) mode and the other is in hot standby (read-only). The primary is continuously archiving, and the standby is continuously recovering; both per WAL (file) shipping. Separately, point-in-time recovery (PITR) and cluster backup (base backup) are available ad-hoc, imperatively, while the servers are online. All such functionality is per canonical PostgreSQL implementation.
The symmetrical arrangement is robust and simple to configure. The distinctions between the two servers amount to a zero-byte signal file (standby.signal
) existing exclusively in the $PGDATA
directory of whichever is in standby mode; that and the requisite anti-symmetrical settings of replication-connection parameters. Beyond that, they are identical. Each server has a unique host name, and each its own data and archive stores. Extending the scheme to multiple standby servers requires only cloning the one and the processes to configure it.
Note that primary/standby are operational declarations, whereas each server/service is configured per hardware declarations and its relevant identity and connection parameters. The point being the former are swappable, while the latter must remain immutable; the (operational) role of a server/service is what toggles, not the server-service-hardware associations. All relevant code must abide the distinction. In a containerized deployment, each PostgreSQL server is a named service, constrained (tethered) to its (configured) hardware regardless of its container(s) popping in and out of existence to provide the services.
Bootstrapping and ad hoc backup (tarball) processes both utilize the same PostgreSQL utility (pg_basebackup
); both operating on the $PGDATA
directory. After servers' initialization and bootstrap, failover(s) are performed by merely adding the standby.signal
file to the former (demoted) primary, and deleting same from the former (promoted) standy. PITR is performed similarly. Though the servers can swap modes on demand at any time, the scheme is primarily for automated failover. Total transition time is set by application latency; there should never be two primary servers, so the demote/promote duration should be sufficient to assure this. Lest the PostgreSQL servers are spread across the globe, this is typically tens of milliseconds. That's the duration over which clients would lose write access on failover (or any other change of primary).
Routing requests to the appropriate server is an external responsibility. According to PostgreSQL documentation, best performance is achieved when the (hot) standby services all read requests, lightening the load of the primary as it services all write requests. The idea there is the anti-symmetrical nature of such processes; the former being relatively greater in number and lower in computational intensity per, and the latter being the reverse of those two metrics.
Features / Functions / Modes / Topologies
- Write-Ahead Log Shipping; Streaming Replication (SR) with Replication Slots (RS) assures replica integrity, yet sacrifices limitation on required storage size.
- Shared-Disk Failover; the simplest HA topology; implement with a lone server (sans SR) as a Docker service; under Docker swarm other topologies are available, scaling out from this one server/storage; multiple nodes/storage schemes.
- Continuous Archiving (CA) works with SR/RS
, but for ephemeral IPsUPDATE: Accepts Docker hostnames. - Two symmetrical servers (primary/standby)
- SR/RS + CA
- Servers each have their own set of data and archive volumes
- Declarative promote/demote.
- Bootstrap standby off of primary.
The primary does the archiving, so it too must be restarted on new config, yet ephemeral IPs then require standby restart to affect newprimary_conninfo
.UPDATE: Using Docker hostnames (See stack YAML) and declarative CIDR subnet, connection config settings are created/loaded only once per cluster; unchanged per (ephemeral) container/IP.
docker network create --driver=overlay --attachable \ --subnet=${PGHA_CIDR} \ --gateway=10.0.200.1 \ ${PGHA_NET}
- Streaming Replication is its own orthogonal world. Each server has its own volume. Archive volume is shared. Restore/backup per bootstrap method only, else recovery states (primary/standby) are out of synch and prevent streaming thereafter.
pg_basebackup
is the tool for both bootstrapping the standby and for archiving; merely different option settings. Used for the former, it generates the config/connectivity settings and writes them topostgresql.auto.conf
, and creates thestandby.signal
file @$PGDATA
that triggers streaming.- Bootstrapping is cloning the
$PGDATA
dir of primary to that of the standby; requires the source (primary) server running and the target (standby-to-be) server not running. See thebootstrap
bash functions @make.do.sh
.
- Bootstrapping is cloning the
- Archive thereunder/thereafter is a clone of
$PGDATA
directory, either as is or twotar(.gz)
files; utilizing the same tool as for bootstrapping (pg_basebackup
), but with different options (sans streaming/config).- one file is of WAL files, the other is of everything else.
- Automating failover requires orthogonal process(es); hence
repmgr
etal.
Contiuous archive --UPDATE: works with SR/RS !! (See above.)archive_mode = on
,archive_command
,restore_command
, ... --interferes with streaming replication, at least if using replication slots. Untested otherwise.
Standalone Hot Backups | pg_basebackup
Simplest, but least live; not HA. See Barman solution.
pg_basebackup
is used to take a base backup of a running PostgreSQL database cluster. The backup is taken without affecting other clients of the database, and can be used both for point-in-time recovery and as the starting point for a log-shipping or streaming-replication standby server.
Makes an exact copy of your data directory so, all you need to do to restore from that backup is to point postgres at that directory and start it up.
pg_basebackup -h $hostname -U $username -D $local_dir
- User must have REPLICATION permissions or be superuser
pg_hba.conf
must permit the replication connection.The server must also be configured with
max_wal_senders
set high enough to provide at least onewalsender
for the backup plus one for WAL streaming (if used).# This works @ Docker stack @ default settings ... pg_basebackup -h localhost -U uzr1 -D /home/pgbasebackup #... 48MB pg_basebackup -h localhost -U uzr1 -D /home/pgbasebackup_tgz --format=tar --gzip #... 4MB
@ service (YAML) :
dbp:
:volumes:
dbp1_data:/var/lib/postgresql/data
(named)${PATH_VM_ASSETS}/sql:/home
(mount)archive_command = 'test ! -f /var/lib/pgsql/backup_in_progress || (test ! -f /var/lib/pgsql/archive/%f && cp %p /var/lib/pgsql/archive/%f)' archive_command = 'local_backup_script.sh "%p" "%f"' # FAILs to do anything at all ... restore_command = 'cp /home/pgbasebackup/%f "%p"'
touch /var/lib/pgsql/backup_in_progress psql -c "select pg_start_backup('hot_backup');" tar -cf /var/lib/pgsql/backup.tar /var/lib/pgsql/data/ psql -c "select pg_stop_backup();" rm /var/lib/pgsql/backup_in_progress tar -rf /var/lib/pgsql/backup.tar /var/lib/pgsql/archive/
Compressed
archive_command = 'gzip < %p > /var/lib/postgresql/data/%f'
restore_command = 'gunzip < /home/pgbasebackup/%f > %p'
Recover/Restore from Hot Backup
Add to postgres.conf
restore_command = 'cp /home/pgbasebackup/%f "%p"'
Add
recovery.signal
file to data directory, and restart???- Manually adding that blank file does in fact trigger some kind of failed attempt to do something, according to the logs, but this "recovery" does nothing whatsoever, except complain that it can't "stat" archived WAL files; that it can't find the archived files it finds! It finds them, hence it can find them.
- Manually adding that blank file does in fact trigger some kind of failed attempt to do something, according to the logs, but this "recovery" does nothing whatsoever, except complain that it can't "stat" archived WAL files; that it can't find the archived files it finds! It finds them, hence it can find them.
recovery.conf
is OBSOLETE per v.12
FATAL: using recovery command file "recovery.conf" is not supported
Add thisrecovery.conf
to data directory.restore_command = 'cp /home/pgbasebackup/%f "%p"' #restore_command = 'gunzip < /home/pgbasebackup_tgz/%f > %p' recovery_target_time = '2022-01-01 00:00:00 UTC' recovery_target_inclusive = false
Restart server
Streaming Replication
pg_basebackup \
-h 192.169.12.1 -p 5432 -U replicator \
-D /var/lib/pgsql/13/data -Fp -R -Xs -P
-Fp
: Plain copy of all sub-directories and their contents (datafiles, etc).-R
: Configure replication specific settings automatically in the postgresql.auto.conf file.-Xs
: Using a separate channel/process, stream ongoing changes (WAL records) from master to standby, while the backup is in progress.-P
: Show the progress of the backup.-c
fast : This flag may be used to perform fast checkpoint and to avoid waiting until the lazy checkpoint is completed.
Continuous Archiving and Point-in-Time Recovery (PITR)
PostgreSQL maintains a write ahead log (WAL) in the /pg_wal/
subdirectory of the data directory; the database can be restored to consistency by replaying the log entries made since the last checkpoint.
A third strategy for backing up databases: combine a file-system-level backup with backup of the WAL files. If recovery is needed, we restore the file system backup and then replay from the backed-up WAL files to bring the system to a current state.
Setting Up WAL Archiving
A running PostgreSQL system produces an indefinitely long sequence of WAL records ... divides this into WAL segment files (16MB
apiece); given numeric names that reflect their position in the abstract WAL sequence. When not using WAL archiving, the system normally creates just a few segment files and then “recycles” them by renaming no-longer-needed segment files to higher segment numbers.
When archiving WAL data, we need to capture the contents of each segment file ... before the segment file is recycled for reuse.
... many different ways of “saving the data somewhere”: copy the segment files to an NFS-mounted directory on another machine, write them onto a tape drive, or batch them together and burn them onto CDs, ...
PostgreSQL lets the administrator specify a shell command to be executed to copy a completed segment file to wherever it needs to go. The command could be as simple as a cp
, or it could invoke a complex shell script — it's all up to you.
To enable WAL archiving, set the wal_level
configuration parameter to replica or higher, archive_mode
to on
, and specify the shell command to use in the archive_command
configuration parameter. In practice these settings will always be placed in the postgresql.conf
file.
Write Ahead Log (WAL) params
@
postgresql.conf
wal_level = replica archive_mode = on # @ Unix archive_command = 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f' # @ Windows archive_command = 'copy "%p" "C:\\server\\archivedir\\%f"'
%p
is path of file to archive.%f
is file name only.
Example archive_command
when run ...
test ! -f /mnt/server/archivedir/00000001000000A900000065 \
&& cp pg_wal/00000001000000A900000065 \
/mnt/server/archivedir/00000001000000A900000065
The speed of the archiving command is unimportant as long as it can keep up with the average rate at which your server generates WAL data. Normal operation continues even if the archiving process falls a little behind. If archiving falls significantly behind, this will increase the amount of data that would be lost in the event of a disaster. It will also mean that the
/pg_wal/
directory will contain large numbers of not-yet-archived segment files, which could eventually exceed available disk space.
Monitor the archiving process to ensure that it is working as you intend.
Replication
Recovering Using a Continuous Archive Backup
restore_command = 'cp /mnt/server/archivedir/%f %p'
Log Shipping : Standby Servers
… continuous archiving to create a high availability (HA) cluster configuration with one or more standby servers ready to take over operations if the primary server fails; Warm Standby or log shipping. … asynchronous, i.e., the WAL records are shipped after transaction commit.
Primary and standby server work together, loosely coupled. Primary server operates in continuous archiving mode. Standby servers operates in continuous recovery mode, reading the WAL files from the primary.
Recovery performance is sufficiently good that the Warm Standby will typically be only moments away from full availability once it has been activated.
Hot Standby server is a standby server that can also be used for read-only queries.
No changes to the database tables are required to enable this capability, so it offers low administration overhead compared to some other replication solutions. This configuration also has relatively low performance impact on the primary server.
Standby Server Operation
A PostgreSQL server enters Standby Mode if a standby.signal
file exists in its data directory when the server is started.
Standby servers continuously apply WAL received either directly from the master (streaming replication), or from a WAL archive (see restore_command
).
/pg_wal
Shared archive directory; must be available to all standby servers in the cluster; the standby servers attempt to restore any WAL files found therein. That typically happens after a server restart, when the standby replays WAL that was streamed from the master before the restart; can also manually copy files to /pg_wal
at any time to have them replayed.
This is a kind of backup bin for Streaming Replication; if the standby servers can't keep up, so the primary "recycles" WAL data before the standby servers process it from the stream, the data will automatically be revovered (replayed) from WAL files in this directory.
The Loop
At startup, the standby begins by restoring all WAL available in the archive location, calling restore_command
. Once it reaches the end of WAL available there and restore_command
fails, it tries to restore any WAL available in the pg_wal
directory. If that fails, and streaming replication has been configured, the standby tries to connect to the primary server and start streaming WAL from the last valid record found in archive or pg_wal. If that fails or streaming replication is not configured, or if the connection is later disconnected, the standby goes back to step 1 and tries to restore the file from the archive again. This loop of retries from the archive, pg_wal
, and via streaming replication goes on until the server is stopped or failover is triggered by a trigger file.
pg_promote()
|pg_ctl promote
|promote_trigger_file
- Standby mode is exited and the server switches to normal operation when
pg_ctl promote
is run,pg_promote()
is called, or a trigger file is found (promote_trigger_file
). Before failover, any WAL immediately available in the archive or inpg_wal
will be restored, but no attempt is made to connect to the master.
- Standby mode is exited and the server switches to normal operation when
Prepare Master for Standby Servers
- Archive directory accessible from the standby; accessible even when the master is down.
- For streaming replication, set up authentication on the primary server
- Allow replication connections from the standby server(s); create a role and provide a suitable entry or entries in
pg_hba.conf
with the database field set toreplication
. - Set
max_wal_senders
@ primary to a sufficiently large value in the configuration file of the primary server. - Set apropos
max_replication_slots
if slots used.
- Allow replication connections from the standby server(s); create a role and provide a suitable entry or entries in
- Bootstrap with a Base Backup
Set Up a Standby Server
- Restore the base backup taken from primary server.
- Create
standby.signal
file in the standby's data directory. - Set
restore_command
to a simple command to copy files from the WAL archive.- If you plan to have multiple standby servers for high availability purposes, make sure that
recovery_target_timeline
is set to latest(the default), to make the standby server follow the timeline change that occurs at failover to another standby.
- If you plan to have multiple standby servers for high availability purposes, make sure that
Set up identical to primary (WAL archiving, connections and authentication) because the standby server will work as a primary server after failover.
If you're using a WAL archive, its size can be minimized using the
archive_cleanup_command
WAL archive parameter to remove files that are no longer required by the standby server. The pg_archivecleanup
utility is designed specifically to be used with archive_cleanup_command
in typical single-standby configurations, see pg_archivecleanup
. Note however, that if you're using the archive for backup purposes, you need to retain files needed to recover from at least the latest base backup, even if they're no longer needed by the standby.
Example configuration (@ postgresql.conf
???):
primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass options=''-c wal_sender_timeout=5000'''
restore_command = 'cp /path/to/archive/%f %p'
archive_cleanup_command = 'pg_archivecleanup /path/to/archive %r'
For streaming replication set max_wal_senders
high enough in the primary to allow them to be connected simultaneously.
Streaming Replication
The step that turns a file-based log-shipping standby into streaming replication standby is setting the primary_conninfo
setting to point to the primary server. Set listen_addresses
and authentication options (see pg_hba.conf
) on the primary so that the standby server can connect.
… allows a standby server to stay more up-to-date than is possible with file-based log shipping; the primary streams WAL records to the standby as they're generated; asynchronous by default; a small delay between committing a transaction in the primary and the changes becoming visible in the standby; the delay is much smaller than with file-based log shipping, typically under one second assuming the standby is powerful enough to keep up with the load. With streaming replication, archive_timeout
is not required to reduce the data loss window.
Use in conjunction with file-based continuous archiving; set up a WAL archive that's accessible from the standby, else ...
If you use streaming replication without file-based continuous archiving, the server might recycle old WAL segments before the standby has received them. If this occurs, the standby will need to be reinitialized from a new base backup. You can avoid this by setting
wal_keep_size
to a value large enough to ensure that WAL segments are not recycled too early, or by configuring a replication slot for the standby. If you set up a WAL archive that's accessible from the standby, these solutions are not required, since the standby can always use the archive to catch up provided it retains enough segments.
On systems that support the keepalive socket option, setting tcp_keepalives_idle
, tcp_keepalives_interval
and tcp_keepalives_count
helps the primary promptly notice a broken connection.
Set the maximum number of concurrent connections from the standby servers (see max_wal_senders
for details).
When the standby is started and primary_conninfo
is set correctly, the standby will connect to the primary after replaying all WAL files available in the archive. If the connection is established successfully, you will see a walreceiver
in the standby, and a corresponding walsender
process in the primary.
Parameters
archive_timeout
… limits size of the data loss window in file-based log shipping … can be set as low as a few seconds. However such a low setting will substantially increase the bandwidth required for file shipping. Streaming replication allows a much smaller window of data loss.
Hot Standby params
- @ Primary
wal_level
vacuum_defer_cleanup_age
- @ Standby
hot_standby
max_standby_archive_delay
max_standby_streaming_delay