I have to update tables of 1 or 2 billion rows with various values for each rows. Each run makes ~100 millions changes (10%).
My first try was to group them in transaction of 300K updates directly on a specific partition as Postgresql not always optimize prepared queries if you use partitions.
- Transactions of bunch of “UPDATE myTable SET myField=value WHERE
myId=id”
Gives 1,500 updates/sec. which means each run would
take at least 18 hours. - HOT updates solution as described here with FILLFACTOR=50. Gives
1,600 updates/sec. I use SSD’s so it’s a costly improvement as it
doubles the storage size. - Insert in a temporary table of updated value and merge them after
with UPDATE…FROM Gives 18,000 updates/sec. if I do a VACUUM
for each partition; 100,000 up/s otherwise. Cooool.
Here is the
sequence of operations:
CREATE TEMP TABLE tempTable (id BIGINT NOT NULL, field(s) to be updated,
CONSTRAINT tempTable_pkey PRIMARY KEY (id));
Accumulate a bunch of updates in a buffer depending of available RAM
When it’s filled, or need to change of table/partition, or completed:
COPY tempTable FROM buffer;
UPDATE myTable a SET field(s)=value(s) FROM tempTable b WHERE a.id=b.id;
COMMIT;
TRUNCATE TABLE tempTable;
VACUUM FULL ANALYZE myTable;
That means a run now takes 1.5h instead of 18h for 100 millions updates, vacuum included. To save time, it’s not necessary to make a vacuum FULL at the end but even a fast regular vacuum is usefull to control your transaction ID on the database and not get unwanted autovacuum during rush hours.