52783

Doing a big update using multiple tables in PostgreSQL

Question:

A few weeks ago I started working with Postgres and I've been trying to solve this problem for the last few days with mixed results: I have a table (10 million rows) that needs to be updated once every month with information stored in a few other tables. During this update no-one has access to the database, so no-one is reading / writing data. I will be the only DB user during that time.

Table A has 10 million rows. One column (rebuilding costs) needs to be updated. Table B, table C and table D have newly calculated rebuilding costs for every row in Table A (so table B, C and D each have 10 million rows). Which table is used for updating Table A depends on the value, see below. I've been trying to do this update with this query:

UPDATE table_A a SET rebuilding_costs = CASE WHEN b.rebuilding_costs BETWEEN 100000 AND 200000 THEN b.rebuilding_costs WHEN c.rebuilding_costs BETWEEN 200001 AND 400000 THEN c.rebuilding_costs WHEN d.rebuilding_costs BETWEEN 400001 AND 600000 THEN d.rebuilding_costs ELSE NULL END FROM table_B b LEFT OUTER JOIN table_C c ON (b.address = c.address) LEFT OUTER JOIN table_D d ON (b.address = d.address) WHERE a.address = b.address ;

This query gives the right result, but it is a little slow (25 minutes). The funny thing is: when table A has 1 million rows (instead of 10 million) it only takes 30 seconds. So when running this query on a table with 10 million rows I expected the query to run for 5 minutes, but it took 25 minutes instead. Thats when I tried to UPDATE in Blocks, so I added this line to the query in the WHERE clause:

AND (a.id > 0 AND a.id < 1000000)

a.id is the primary key of table A. In this example only the first million rows of table A get updated. However, it took 3 minutes to complete. You would have to do it 10 times to update all 10 million rows so that would be 30 minutes..

I also tried to prevent the query from doing someone with rows that are not going to be changed by the UPDATE by adding this:

AND a.herbouwwaarde_indicatie IS DISTINCT FROM b.inhoud AND a.herbouwwaarde_indicatie IS DISTINCT FROM c.inhoud AND a.herbouwwaarde_indicatie IS DISTINCT FROM d.inhoud

Setting the fillfactor to 70 and 50 did help to improve the speed a little bit, but I couldnt get it under 20 minutes.

I also tried to recreate table A, like:

CREATE TABLE table_A_new AS SELECT a.address, CASE WHEN b.rebuilding_costs BETWEEN 100000 AND 200000 THEN b.rebuilding_costs WHEN c.rebuilding_costs BETWEEN 200001 AND 400000 THEN c.rebuilding_costs WHEN d.rebuilding_costs BETWEEN 400001 AND 600000 THEN d.rebuilding_costs ELSE NULL END rebuildingcosts FROM table_A a LEFT OUTER JOIN table_B b ON (a.address = b.address) LEFT OUTER JOIN table_C c ON (a.address = c.address) LEFT OUTER JOIN table_D d ON (a.address = d.address) ;

This is very fast (2 minutes) and gives the right result. However, you have to recreate an entire table. Somehow that doesnt seem very efficient (and takes a lot of storage space), that's why I started out with UPDATE query's.

My question is: what is the best way to proceed? Is there a way to improve the performance of the UPDATE query, or would it be best to look for an alternative such as the 'create table' one in the example above.

I have to use Postrgres though, I cant switch to a different DBMS.

This is the execution plan of the UPDATE query (right now the tables don't have 10 million rows, but approximately 6 - 8 million rows):

Update on tabel_A a (cost=902288.27..2150690.80 rows=6714762 width=65) -> Hash Join (cost=902288.27..2150690.80 rows=6714762 width=65) Hash Cond: ((b.adres)::text = a.adres)" Join Filter: ((b.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs) AND (c.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs)) AND (d.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs)) -> Hash Left Join (cost=522527.27..1318059.42 rows=6716471 width=39) Hash Cond: ((b.adres)::text = (d.adres)::text) -> Hash Right Join (cost=295916.60..817658.93 rows=6716471 width=29) Hash Cond: ((c.adres)::text = (b.adres)::text) -> Seq Scan on Tabel_C c (cost=0.00..240642.35 rows=7600735 width=19) -> Hash (cost=172605.71..172605.71 rows=6716471 width=19) -> Seq Scan on tabel_B b (cost=0.00..172605.71 rows=6716471 width=19) -> Hash (cost=103436.52..103436.52 rows=6709052 width=19)" -> Seq Scan on tabel D d (cost=0.00..103436.52 rows=6709052 width=19)" -> Hash (cost=217261.00..217261.00 rows=8000000 width=39)" -> Seq Scan on Tabel_A a (cost=0.00..217261.00 rows=8000000 width=39)"

This is the Explain analyse output:

Update on Tabel_A a (cost=902288.27..2150690.80 rows=6714762 width=65) (actual time=2815452.997..2815452.997 rows=0 loops=1) -> Hash Join (cost=902288.27..2150690.80 rows=6714762 width=65) (actual time=108861.999..214888.780 rows=5252864 loops=1) Hash Cond: ((b.adres)::text = a.adres) Join Filter: ((b.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs) AND (c.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs)) (d.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs)) Rows Removed by Join Filter: 670998 -> Hash Left Join (cost=522527.27..1318059.42 rows=6716471 width=39) (actual time=43138.635..116933.803 rows=6711432 loops=1)" Hash Cond: ((b.adres)::text = (d.adres)::text)" -> Hash Right Join (cost=295916.60..817658.93 rows=6716471 width=29) (actual time=34571.750..99040.256 rows=6710550 loops=1)" Hash Cond: ((c.adres)::text = (b.adres)::text)" -> Seq Scan on Tabel_C c (cost=0.00..240642.35 rows=7600735 width=19) (actual time=127.080..59703.935 rows=7595083 loops=1)" -> Hash (cost=172605.71..172605.71 rows=6716471 width=19) (actual time=29925.787..29925.787 rows=6709229 loops=1)" Buckets: 2048 Batches: 512 Memory Usage: 678kB" -> Seq Scan on Tabel_B b (cost=0.00..172605.71 rows=6716471 width=19) (actual time=0.017..27245.069 rows=6709229 loops=1)" -> Hash (cost=103436.52..103436.52 rows=6709052 width=19) (actual time=8566.848..8566.848 rows=6709229 loops=1)" Buckets: 2048 Batches: 512 Memory Usage: 678kB" -> Seq Scan on Tabel_D d (cost=0.00..103436.52 rows=6709052 width=19) (actual time=0.009..5970.010 rows=6709229 loops=1)" -> Hash (cost=217261.00..217261.00 rows=8000000 width=39) (actual time=65721.815..65721.815 rows=8000000 loops=1)" Buckets: 2048 Batches: 1024 Memory Usage: 612kB" -> Seq Scan on Tabel_A a (cost=0.00..217261.00 rows=8000000 width=39) (actual time=0.056..55968.171 rows=8000000 loops=1)" Total runtime: 2815453.549 ms"

Table A, B, C and D have all indexes on the address column, the column that is being used to join on. Te tables have no other indexes. Table A does have a primary key (id). Table B, C and D have between 5 and 7 columns extra that are not being used in this process.

Answer1:

This query should produce the same results, but avoids the CASE

-- EXPLAIN ANALYZE UPDATE table_a a SET rebuilding_costs = drie.rebuilding_costs FROM ( SELECT COALESCE(b.address, c.address, d.address) AS address , COALESCE(b.rebuilding_costs, c.rebuilding_costs, d.rebuilding_costs) AS rebuilding_costs FROM table_b b FULL OUTER JOIN table_c c ON c.address = b.address AND c.rebuilding_costs BETWEEN 200001 AND 400000 FULL OUTER JOIN table_D d ON d.address = b.address AND d.rebuilding_costs BETWEEN 400001 AND 600000 WHERE b.rebuilding_costs BETWEEN 100001 AND 200000 ) drie WHERE a.address = drie.address AND a.rebuilding_costs <> drie.rebuilding_costs -- Avoid useless updates ; <hr />

UPDATE: A similar approach, based on a chained set of CTEs:

-- -------------------------------- EXPLAIN ANALYZE WITH cte_b AS ( SELECT b.address, b.rebuilding_costs FROM table_b b WHERE b.rebuilding_costs BETWEEN 100001 AND 200000 ) , cte_c AS ( SELECT c.address , c.rebuilding_costs FROM table_c c WHERE c.rebuilding_costs BETWEEN 200001 AND 400000 AND NOT EXISTS (SELECT * FROM cte_b WHERE cte_b.address = c.address) ) , cte_d AS ( SELECT d.address , d.rebuilding_costs FROM table_d d WHERE d.rebuilding_costs BETWEEN 400001 AND 600000 AND NOT EXISTS (SELECT * FROM cte_b WHERE cte_b.address = d.address) AND NOT EXISTS (SELECT * FROM cte_c WHERE cte_c.address = d.address) ) , cte_bcd AS ( SELECT cte_b.address, cte_b.rebuilding_costs FROM cte_b UNION ALL SELECT cte_c.address, cte_c.rebuilding_costs FROM cte_c UNION ALL SELECT cte_d.address, cte_d.rebuilding_costs FROM cte_d ) UPDATE table_a a SET rebuilding_costs = cte_bcd.rebuilding_costs FROM cte_bcd WHERE a.address = cte_bcd.address -- avoid useless updates this way: AND a.rebuilding_costs <> cte_bcd.rebuilding_costs -- ,or this way: -- AND cte_bcd.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs ; <hr />

UPDATE2: CTE's can be slow, because they act as optimisation barriers. A quick&dirty way is to rewrite tem as (temporary) VIEWs, and refer to these instead. This allows the optimiser to shuffle the various parts of the query into and out off subqueries, and even combine and reuse these.

CREATE TEMP VIEW cte_b AS ( SELECT b.address, b.rebuilding_costs FROM table_b b WHERE b.rebuilding_costs BETWEEN 100001 AND 200000 ); CREATE TEMP VIEW cte_c AS ( SELECT c.address , c.rebuilding_costs FROM table_c c WHERE c.rebuilding_costs BETWEEN 200001 AND 400000 AND NOT EXISTS (SELECT * FROM cte_b WHERE cte_b.address = c.address) ); CREATE TEMP VIEW cte_d AS ( SELECT d.address , d.rebuilding_costs FROM table_d d WHERE d.rebuilding_costs BETWEEN 400001 AND 600000 AND NOT EXISTS (SELECT * FROM cte_b WHERE cte_b.address = d.address) AND NOT EXISTS (SELECT * FROM cte_c WHERE cte_c.address = d.address) ); CREATE TEMP VIEW cte_bcd AS ( SELECT cte_b.address, cte_b.rebuilding_costs FROM cte_b UNION ALL SELECT cte_c.address, cte_c.rebuilding_costs FROM cte_c UNION ALL SELECT cte_d.address, cte_d.rebuilding_costs FROM cte_d ); EXPLAIN -- ANALYZE UPDATE table_a a SET rebuilding_costs = cte_bcd.rebuilding_costs FROM cte_bcd WHERE a.address = cte_bcd.address AND a.rebuilding_costs <> cte_bcd.rebuilding_costs -- avoid useless updates -- AND a.address < 100000 ;

Recommend

  • How can I use sklearn CountVectorizer with mutliple strings?
  • Load external JSON file
  • Share App on Facebook
  • How to create a separation between two SpriteKit objects
  • Handle user input type mismatch in FORTRAN
  • How to catch crash without disturbing the normal crash flow
  • Newsstand resume download after the App closed by user Fully
  • Filtering dataframe using hashmap
  • c# split string with long space or tabs
  • Minimum number of bytes that can contain an integer value
  • Logstash parsing progress bar
  • One class instance used automatically throughout process
  • Embed custom control in WPF FlowDocument
  • How to define or support a code language on Visual Studio
  • Uploading files with asp.net core into database as byte array
  • Is Win2D yet available in C++/WinRt?
  • python assign value to pandas df if falls between range of dates in another df
  • Creating action API with Rails. Do I need to use respond_with
  • xpage creates save conflict on save/submit
  • pip installation of gmpy2
  • Condition on a timestamp column to select data for a year
  • Intersecting a list of data frames in R
  • sending emails in python weird behaviour
  • MvvmCross - View not loaded
  • 'Expiration time buffer' of SessionSecurityTokenHandler of WIF 4.5 in Azure web role
  • Symfony 2:image paths in javascript file with assetic
  • How to add blur effect into UIImage with Swift? [duplicate]
  • Limit # of records returned based on a form control
  • Using one probability set to generate another [duplicate]
  • Using PHP & Tornado
  • Why clearfsimport command adds file to source control with a size zero?
  • How to delete first 7 characters of folder name by using batch script?
  • Adding Dynamic Row and Data on Checkbox Click
  • Homomorphic filtering in OpenCV
  • How does the dispatcher work when mixing sync/async with serial/concurrent queue?
  • How to get “crispEdges” for SVG text?
  • JQuery Mobile Ajax Navigation in Single-Page Template
  • How to select multiple items from a List view - JavaFX 8