Speeding up NOT IN

Sometimes, old tricks can still work.. I had a case the other day where I needed to insert into a table all rows "missing" compared to a different table. Just the id value, and let all others get the default value. The query to do this the simple way is:

INSERT INTO table2 (itemid) 

  SELECT id FROM table1 

  WHERE id NOT IN (

    SELECT itemid FROM table2

  )

This would be fine, except there are a lot of rows in the tables. Specifically, table2 contains about 10 million and table1 contains 15 million. So the insert should be about 5 million rows. It's not huge, but it's enough to matter.

Left this query running for hours. And hours. And hours. The data fit fine in RAM, so it spent it all at 100%25 CPU and no I/O. Left it running for 12+ hours, still not finished. So it was time to try something else. Tried to rewrite the query to:

INSERT INTO table2 (itemid)

  SELECT id FROM table1

  WHERE NOT EXISTS (

   SELECT * FROM TABLE2 WHERE table2.itemid=table1.id

  )

But no luck. Terminated this one after 7 hours. The query plan for both of these was the same:

Seq Scan on table1  (cost=0.00..3902265020963.13 rows=7539834 width=4)

  Filter: (NOT (subplan))

  SubPlan

    - Seq Scan on table2 s2  (cost=0.00..258776.46 rows=1 width=38)

          Filter: (itemid = $0)

So it's time to bring out the dirty tricks. Specifically something that I do recall at least old versions of access used to do - back when they simply didn't do NOT IN subqueries (yes, I'm slightly embarrassed, but I did hack around access a bit at that time when forced to). The new code is (uses a temporary table, but that's not directly related to the speed change):

INSERT INTO loader (id)

  SELECT id FROM table1

  LEFT JOIN table2 ON table1.id=table2.itemid

  WHERE table2.itemid IS NULL

And wham. Query completes in less than 30 minutes. The INSERT into the actual table is very quick - single digit number of minutes, didn't get an actual measure for it. The new query plan is:

Hash Left Join  (cost=435127.62..2902887.99 rows=1 width=4)

  Hash Cond: (table1.id = table2.itemid)

  Filter: (table2.itemid IS NULL)

  - Seq Scan on table1  (cost=0.00..1880248.68 rows=15079668 width=4)

  - Hash  (cost=229150.50..229150.50 rows=11849450 width=4)

        - Seq Scan on table2  (cost=0.00..229150.50 rows=11849450 width=4)

For some reason the row estimate look differently - but it's the same tables, and nothing happened in between. And either way - it's slightly hackish, but it worked.


Conferences

I speak at and organize conferences around Open Source in general and PostgreSQL in particular.

Upcoming

Stockholm PUG 2016/4
Aug 31, 2016
Stockholm, Sweden
Postgres Open
Sep 13-16, 2016
Dallas, USA
Postgres Vision 2016
Oct 11-Nov 13, 2016
San Francisco, USA
PGConf.EU 2016
Nov 1-Jan 4, 2016
Tallinn, Estonia
PGConf.Asia 2016
Dec 2-3, 2016
Tokyo, Japan

Past

PG Day'16 Russia
Jul 6-8, 2016
St Petersburg, Russia
PGDay UK 2016
Jul 05, 2016
London, UK
Stockholm PUG 2016/3
Jun 16, 2016
Stockholm, Sweden
PGCon
May 17-21, 2016
Ottawa, Canada
Pycon Sweden
May 9-10, 2016
Stockholm, Sweden
More past conferences