Database or schema

A couple of weeks back I helped a guy out with a fairly large PostgreSQL installation, with several things. One of them prompted me to write this post. A fairly common question is "should I use databases or schemas to isolate my users". A fairly common answer is "schemas is the same as databases in other products", which isn't actually true (at least not for all of them). Schemas in PostgreSQL can be very flexible though - much more so than in say SQL Server, where a schema and a user is pretty much the same thing. But there are reasons for using databases, and reasons for using schemas. This was a textbook example on why to use schemas instead of databases.

The system they run has the database hosted in on their server, and a bunch of clients of different kinds (windows fat clients connecting directly and mobile devices connecting through an appserver). The system was set up with one database for each client, and all connecting using the postgres superuser account (oops. but that's a different thing).

An interesting point in this was that apart from the dedicated information for each client, there was a shared address table that had about 300Mb (IIRC) of data in it. This was 99.9%25 the same data between all clients, but each client had the ability to add extra rows to the table.

One of the things I ended up doing with them was to change the model to a single database shared by all clients, with one schema for each client. By giving each client their own account instead of using the superuser account, almost no changes were necessary in the client application - they would automatically see the tables in their own schema (since it's named the same as the account), but have no access to the data from other clients. There were several advantages gained by this:

  • Less overhead, since there were much fewer databases.
  • The ability to do cross-schema queries when needed, to perform cross-client statistical calculations (not by the clients themselves, but from the provider)
  • Less databases to remember to back up ;)

With that step done, the next step was very logical. We moved the address table to the public schema, and replaced it with a view in each of the client schemas. This view was a UNION ALL of the address table in the public schema, and a table called address_local that lived in each clients schema. This table was populated with all the entires the clients had added to their local copy. A couple of RULEs on the view would redirect modifications to the view to hit the local table (since the clients only had SELECT permissions on the shared table, and they really shouldn't modify that). Later on we changed the application to write directly to the address_local table, purely for performance reasons. This brought in several more advantages:

  • There is now only one copy of this large, shared table in the database. That greatly increases the probability of it staying in the cache, as well as reduces the risk that a scan on this table will push other tables out of the cache.
  • The address data is updated occasionally from an external source. Now it only has to be updated once, which will affect all clients
  • Backups are a lot smaller since the address data is only backed up once.

These are all very good arguments for using schemas instead of databases. But what are the downsides?

  • You can't use pg_hba.conf to restrict database access. In this case, that wasn't a reason, since they were all using the same account before. But it's a feature that will no longer work.
  • LISTEN/NOTIFY is per database. So the name of the NOTIFYs must change to include the name of the client
  • Restoring a single client can be somewhat more complex

I'm sure there are other arguments on both the positive and the negative side, but those are the ones we came across.


Comments

Thanks so much! Your setup helped me out a ton when figuring out how to set up a shared database for an application that wasn't originally designed for multiple users.

Posted on Nov 13, 2009 at 19:21 by J. Mar.