Speeding up NOT IN

Sometimes, old tricks can still work.. I had a case the other day where I needed to insert into a table all rows "missing" compared to a different table. Just the id value, and let all others get the default value. The query to do this the simple way is:

INSERT INTO table2 (itemid) 

  SELECT id FROM table1 

  WHERE id NOT IN (

    SELECT itemid FROM table2

  )

This would be fine, except there are a lot of rows in the tables. Specifically, table2 contains about 10 million and table1 contains 15 million. So the insert should be about 5 million rows. It's not huge, but it's enough to matter.

Left this query running for hours. And hours. And hours. The data fit fine in RAM, so it spent it all at 100%25 CPU and no I/O. Left it running for 12+ hours, still not finished. So it was time to try something else. Tried to rewrite the query to:

INSERT INTO table2 (itemid)

  SELECT id FROM table1

  WHERE NOT EXISTS (

   SELECT * FROM TABLE2 WHERE table2.itemid=table1.id

  )

But no luck. Terminated this one after 7 hours. The query plan for both of these was the same:

Seq Scan on table1  (cost=0.00..3902265020963.13 rows=7539834 width=4)

  Filter: (NOT (subplan))

  SubPlan

    - Seq Scan on table2 s2  (cost=0.00..258776.46 rows=1 width=38)

          Filter: (itemid = $0)

So it's time to bring out the dirty tricks. Specifically something that I do recall at least old versions of access used to do - back when they simply didn't do NOT IN subqueries (yes, I'm slightly embarrassed, but I did hack around access a bit at that time when forced to). The new code is (uses a temporary table, but that's not directly related to the speed change):

INSERT INTO loader (id)

  SELECT id FROM table1

  LEFT JOIN table2 ON table1.id=table2.itemid

  WHERE table2.itemid IS NULL

And wham. Query completes in less than 30 minutes. The INSERT into the actual table is very quick - single digit number of minutes, didn't get an actual measure for it. The new query plan is:

Hash Left Join  (cost=435127.62..2902887.99 rows=1 width=4)

  Hash Cond: (table1.id = table2.itemid)

  Filter: (table2.itemid IS NULL)

  - Seq Scan on table1  (cost=0.00..1880248.68 rows=15079668 width=4)

  - Hash  (cost=229150.50..229150.50 rows=11849450 width=4)

        - Seq Scan on table2  (cost=0.00..229150.50 rows=11849450 width=4)

For some reason the row estimate look differently - but it's the same tables, and nothing happened in between. And either way - it's slightly hackish, but it worked.


Conferences

I speak at and organize conferences around Open Source in general and PostgreSQL in particular.

Upcoming

SCALE+PGDays
Mar 2-5, 2017
Pasadena, California, USA
Open Source Infrastructure @ SCALE
Mar 2, 2017
Pasadena, California, USA
Confoo Montreal 2017
Mar 8-10, 2017
Montreal, Canada
Nordic PGDay 2017
Mar 21, 2017
Stockholm, Sweden
pgDay.paris 2017
Mar 23, 2017
Paris, France
PGCon 2017
May 23-26, 2017
Ottawa, Canada

Past

FOSDEM + PGDay 2017
Feb 2-4, 2017
Brussels, Belgium
PGConf.Asia 2016
Dec 2-3, 2016
Tokyo, Japan
Berlin PUG
Nov 17, 2016
Berlin, Germany
PGConf.EU 2016
Nov 1-4, 2016
Tallinn, Estonia
Stockholm PUG 2016/5
Oct 25, 2016
Stockholm, Sweden
More past conferences