Speeding up NOT IN

Sometimes, old tricks can still work.. I had a case the other day where I needed to insert into a table all rows "missing" compared to a different table. Just the id value, and let all others get the default value. The query to do this the simple way is:

INSERT INTO table2 (itemid) 

  SELECT id FROM table1 

  WHERE id NOT IN (

    SELECT itemid FROM table2

  )

This would be fine, except there are a lot of rows in the tables. Specifically, table2 contains about 10 million and table1 contains 15 million. So the insert should be about 5 million rows. It's not huge, but it's enough to matter.

Left this query running for hours. And hours. And hours. The data fit fine in RAM, so it spent it all at 100%25 CPU and no I/O. Left it running for 12+ hours, still not finished. So it was time to try something else. Tried to rewrite the query to:

INSERT INTO table2 (itemid)

  SELECT id FROM table1

  WHERE NOT EXISTS (

   SELECT * FROM TABLE2 WHERE table2.itemid=table1.id

  )

But no luck. Terminated this one after 7 hours. The query plan for both of these was the same:

Seq Scan on table1  (cost=0.00..3902265020963.13 rows=7539834 width=4)

  Filter: (NOT (subplan))

  SubPlan

    - Seq Scan on table2 s2  (cost=0.00..258776.46 rows=1 width=38)

          Filter: (itemid = $0)

So it's time to bring out the dirty tricks. Specifically something that I do recall at least old versions of access used to do - back when they simply didn't do NOT IN subqueries (yes, I'm slightly embarrassed, but I did hack around access a bit at that time when forced to). The new code is (uses a temporary table, but that's not directly related to the speed change):

INSERT INTO loader (id)

  SELECT id FROM table1

  LEFT JOIN table2 ON table1.id=table2.itemid

  WHERE table2.itemid IS NULL

And wham. Query completes in less than 30 minutes. The INSERT into the actual table is very quick - single digit number of minutes, didn't get an actual measure for it. The new query plan is:

Hash Left Join  (cost=435127.62..2902887.99 rows=1 width=4)

  Hash Cond: (table1.id = table2.itemid)

  Filter: (table2.itemid IS NULL)

  - Seq Scan on table1  (cost=0.00..1880248.68 rows=15079668 width=4)

  - Hash  (cost=229150.50..229150.50 rows=11849450 width=4)

        - Seq Scan on table2  (cost=0.00..229150.50 rows=11849450 width=4)

For some reason the row estimate look differently - but it's the same tables, and nothing happened in between. And either way - it's slightly hackish, but it worked.


Conferences

I speak at and organize conferences around Open Source in general and PostgreSQL in particular.

Upcoming

Stockholm PUG 2016/3
Jun 16, 2016
Stockholm, Sweden
PGDay UK 2016
Jul 05, 2016
London, UK
PG Day'16 Russia
Jul 6-8, 2016
St Petersburg, Russia
Stockholm PUG 2016/4
Aug 31, 2016
Stockholm, Sweden
Postgres Open
Sep 13-16, 2016
Dallas, USA
PGConf.EU 2016
Nov 1-Jan 4, 2016
Tallinn, Estonia

Past

PGCon
May 17-21, 2016
Ottawa, Canada
Pycon Sweden
May 9-10, 2016
Stockholm, Sweden
Stockholm PUG 2016/2
Apr 26, 2016
Stockholm, Sweden
PGConf.US
Apr 18-20, 2016
New York, USA
pgDay Paris
Mar 31, 2016
Paris, France
More past conferences