Diffing two sets

Something I've had to do fairly often in SQL is finding the difference between two sets. A typical case which came up again this week is when you have a set of snapshots in a history table, and want to find which keys were added and removed between the two occassions. One way to do this is with a set of queries - one for what was added, one for what was removed, and possibly one for what wasn't changed (if you need that data). You can also do this with OUTER JOINs, which can often be a lot more efficient given that you can answer all your questions in one query.

So, to illustrate this, here's a simple, sample setup. It's basically a table that contains snapshots of a set of personnummer (swedish social security numbers, but it can of course be anything you want. If it's a single unique field, that makes life a lot easier). Snapshots are generated at a certain date by a scheduled taks that inserts the result of a select on a different set of tables. For this illustration, I'll just insert a couple of bogus rows of data. [HTML_REMOVED] CREATE TABLE snapshots (

snapdate timestamptz NOT NULL,

personnr char(13) NOT NULL,

CONSTRAINT pk_snapshots PRIMARY KEY (snapdate,personnr)

);

INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','11111111-1111');

INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','22222222-2222');

INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','33333333-3333');

INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','44444444-4444');

INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','22222222-2222');

INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','33333333-3333');

INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','55555555-5555');

INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','66666666-6666');



Now, to simply get whatever rows were either added or removed between these two sets, use the following query:

<code style="white-space:pre">
SELECT 

     COALESCE(s1.personnr,s2.personnr) AS personnr,

     CASE WHEN s1.personnr IS NULL THEN 'Added' 

          ELSE 'Removed' END AS action

FROM

  (SELECT personnr FROM snapshots WHERE snapdate='2006-12-01 00:00:00')

    AS s1

FULL OUTER JOIN

  (SELECT personnr FROM snapshots WHERE snapdate='2006-12-02 00:00:00')

    AS s2

ON s1.personnr=s2.personnr

WHERE s1.personnr IS NULL OR s2.personnr IS NULL

If you want to see both the changed and the unchanged rows, remove the WHERE statement and expand the CASE like this:

[HTML_REMOVED] SELECT

 COALESCE(s1.personnr,s2.personnr) AS personnr,

 CASE WHEN s1.personnr IS NULL THEN 'Added'

      WHEN s2.personnr IS NULL THEN 'Removed'

      ELSE 'Unchanged' END AS action

FROM

(SELECT personnr FROM snapshots WHERE snapdate='2006-12-01 00:00:00')

AS s1

FULL OUTER JOIN

(SELECT personnr FROM snapshots WHERE snapdate='2006-12-02 00:00:00')

AS s2

ON s1.personnr=s2.personnr ```

If you want only changes in one direction, you should be using LEFT or RIGHT JOIN instead, and if you want just the ones not changed, INNER JOIN. You can still do it with OUTER JOIN, but it will likely be much better performing using the proper JOIN for the case.


Conferences

I speak at and organize conferences around Open Source in general and PostgreSQL in particular.

Upcoming

Pycon Sweden
May 9-10, 2016
Stockholm, Sweden
PGCon
May 17-21, 2016
Ottawa, Canada
Stockholm PUG 2016/3
Jun 16, 2016
Stockholm, Sweden
PGDay UK 2016
Jul 05, 2016
London, UK
PG Day'16 Russia
Jul 6-8, 2016
St Petersburg, Russia
Stockholm PUG 2016/4
Aug 31, 2016
Stockholm, Sweden
Postgres Open
Sep 13-16, 2016
Dallas, USA
PGConf.EU 2016
Nov 1-Jan 4, 2016
Tallinn, Estonia

Past

Stockholm PUG 2016/2
Apr 26, 2016
Stockholm, Sweden
PGConf.US
Apr 18-20, 2016
New York, USA
pgDay Paris
Mar 31, 2016
Paris, France
Nordic PGDay
Mar 17, 2016
Helsinki, Finland
ConFoo
Feb 24-26, 2016
Montreal, Canada
More past conferences