Speeding up NOT IN

Sometimes, old tricks can still work.. I had a case the other day where I needed to insert into a table all rows "missing" compared to a different table. Just the id value, and let all others get the default value. The query to do this the simple way is:

INSERT INTO table2 (itemid) 

  SELECT id FROM table1 


    SELECT itemid FROM table2


This would be fine, except there are a lot of rows in the tables. Specifically, table2 contains about 10 million and table1 contains 15 million. So the insert should be about 5 million rows. It's not huge, but it's enough to matter.

Left this query running for hours. And hours. And hours. The data fit fine in RAM, so it spent it all at 100%25 CPU and no I/O. Left it running for 12+ hours, still not finished. So it was time to try something else. Tried to rewrite the query to:

INSERT INTO table2 (itemid)

  SELECT id FROM table1


   SELECT * FROM TABLE2 WHERE table2.itemid=table1.id


But no luck. Terminated this one after 7 hours. The query plan for both of these was the same:

Seq Scan on table1  (cost=0.00..3902265020963.13 rows=7539834 width=4)

  Filter: (NOT (subplan))


    - Seq Scan on table2 s2  (cost=0.00..258776.46 rows=1 width=38)

          Filter: (itemid = $0)

So it's time to bring out the dirty tricks. Specifically something that I do recall at least old versions of access used to do - back when they simply didn't do NOT IN subqueries (yes, I'm slightly embarrassed, but I did hack around access a bit at that time when forced to). The new code is (uses a temporary table, but that's not directly related to the speed change):

INSERT INTO loader (id)

  SELECT id FROM table1

  LEFT JOIN table2 ON table1.id=table2.itemid

  WHERE table2.itemid IS NULL

And wham. Query completes in less than 30 minutes. The INSERT into the actual table is very quick - single digit number of minutes, didn't get an actual measure for it. The new query plan is:

Hash Left Join  (cost=435127.62..2902887.99 rows=1 width=4)

  Hash Cond: (table1.id = table2.itemid)

  Filter: (table2.itemid IS NULL)

  - Seq Scan on table1  (cost=0.00..1880248.68 rows=15079668 width=4)

  - Hash  (cost=229150.50..229150.50 rows=11849450 width=4)

        - Seq Scan on table2  (cost=0.00..229150.50 rows=11849450 width=4)

For some reason the row estimate look differently - but it's the same tables, and nothing happened in between. And either way - it's slightly hackish, but it worked.


I speak at and organize conferences around Open Source in general and PostgreSQL in particular.


Stockholm PUG 2016/5
Oct 25, 2016
Stockholm, Sweden
PGConf.EU 2016
Nov 1-4, 2016
Tallinn, Estonia
Berlin PUG
Nov 17, 2016
Berlin, Germany
PGConf.Asia 2016
Dec 2-3, 2016
Tokyo, Japan
FOSDEM + PGDay 2017
Feb 2-4, 2017
Brussels, Belgium


Postgres Vision 2016
Oct 11-13, 2016
San Francisco, USA
Postgres Open
Sep 13-16, 2016
Dallas, USA
Stockholm PUG 2016/4
Aug 31, 2016
Stockholm, Sweden
PG Day'16 Russia
Jul 6-8, 2016
St Petersburg, Russia
PGDay UK 2016
Jul 05, 2016
London, UK
More past conferences