Diffing two sets

Something I've had to do fairly often in SQL is finding the difference between two sets. A typical case which came up again this week is when you have a set of snapshots in a history table, and want to find which keys were added and removed between the two occassions. One way to do this is with a set of queries - one for what was added, one for what was removed, and possibly one for what wasn't changed (if you need that data). You can also do this with OUTER JOINs, which can often be a lot more efficient given that you can answer all your questions in one query.

So, to illustrate this, here's a simple, sample setup. It's basically a table that contains snapshots of a set of personnummer (swedish social security numbers, but it can of course be anything you want. If it's a single unique field, that makes life a lot easier). Snapshots are generated at a certain date by a scheduled taks that inserts the result of a select on a different set of tables. For this illustration, I'll just insert a couple of bogus rows of data. [HTML_REMOVED] CREATE TABLE snapshots (

snapdate timestamptz NOT NULL,

personnr char(13) NOT NULL,

CONSTRAINT pk_snapshots PRIMARY KEY (snapdate,personnr)

);

INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','11111111-1111');

INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','22222222-2222');

INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','33333333-3333');

INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','44444444-4444');

INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','22222222-2222');

INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','33333333-3333');

INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','55555555-5555');

INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','66666666-6666');



Now, to simply get whatever rows were either added or removed between these two sets, use the following query:

<code style="white-space:pre">
SELECT 

     COALESCE(s1.personnr,s2.personnr) AS personnr,

     CASE WHEN s1.personnr IS NULL THEN 'Added' 

          ELSE 'Removed' END AS action

FROM

  (SELECT personnr FROM snapshots WHERE snapdate='2006-12-01 00:00:00')

    AS s1

FULL OUTER JOIN

  (SELECT personnr FROM snapshots WHERE snapdate='2006-12-02 00:00:00')

    AS s2

ON s1.personnr=s2.personnr

WHERE s1.personnr IS NULL OR s2.personnr IS NULL

If you want to see both the changed and the unchanged rows, remove the WHERE statement and expand the CASE like this:

[HTML_REMOVED] SELECT

 COALESCE(s1.personnr,s2.personnr) AS personnr,

 CASE WHEN s1.personnr IS NULL THEN 'Added'

      WHEN s2.personnr IS NULL THEN 'Removed'

      ELSE 'Unchanged' END AS action

FROM

(SELECT personnr FROM snapshots WHERE snapdate='2006-12-01 00:00:00')

AS s1

FULL OUTER JOIN

(SELECT personnr FROM snapshots WHERE snapdate='2006-12-02 00:00:00')

AS s2

ON s1.personnr=s2.personnr ```

If you want only changes in one direction, you should be using LEFT or RIGHT JOIN instead, and if you want just the ones not changed, INNER JOIN. You can still do it with OUTER JOIN, but it will likely be much better performing using the proper JOIN for the case.


Conferences

I speak at and organize conferences around Open Source in general and PostgreSQL in particular.

Upcoming

SCALE+PGDays
Mar 2-5, 2017
Pasadena, California, USA
Open Source Infrastructure @ SCALE
Mar 2, 2017
Pasadena, California, USA
Confoo Montreal 2017
Mar 8-10, 2017
Montreal, Canada
Nordic PGDay 2017
Mar 21, 2017
Stockholm, Sweden
pgDay.paris 2017
Mar 23, 2017
Paris, France
PGCon 2017
May 23-26, 2017
Ottawa, Canada

Past

FOSDEM + PGDay 2017
Feb 2-4, 2017
Brussels, Belgium
PGConf.Asia 2016
Dec 2-3, 2016
Tokyo, Japan
Berlin PUG
Nov 17, 2016
Berlin, Germany
PGConf.EU 2016
Nov 1-4, 2016
Tallinn, Estonia
Stockholm PUG 2016/5
Oct 25, 2016
Stockholm, Sweden
More past conferences