Custom synonym dictionaries in tsearch2

(this could be considered part 4 of the PostgreSQL website search update)

Custom synonym dictionaries is not actually new functionality for 8.2 - I think it's actually been around since 7.4 or so. But it's definitly been a function I did not know about, and it can be really useful in scenarios when you know some specifics about your data. In general, selecting and tuning your dictionaries in tsearch2 can give a big boost to the accuracy of your searches, so it's definitly worth looking into.

The concept of custom synonym dictionaries is really simple (and so is the implementation) - you get to define synonyms for certain words that all the tsearch2 functions will then apply to both indexing and searching, including the headline() function. For example, how many people incorrectly call PostgreSQL "postgre"? Answer: way too many. But that doesn't mean we shouldn't help them by providing whatever search matches we can. Also, some people refer to it as pgsql, and why should the user need to know that? [HTML_REMOVED] So how is this done in tsearch2? With the dict_syn dictionary it's very simple, and this dictionary is built and installed by default as long as you install tsearch2 (from source - I assume it's in all binary packages as well, since there is no config switch to turn it off automaticaly).

The first step is to create the dictionary file itself. It's a simple textfile (though you could keep it in the database and update it with a trigger in for example pl/perlu if you wanted to) with two whitespace separated fields listing the word and it's synonym. For the websearch, I started out with the following, a simple set of synonyms for the word postgres (all words are case-insensitive at this point):

[HTML_REMOVED] postgresql postgres

pgsql postgres

postgre postgres



I chose to store this along with other tsearch2 files in *<pgdir>/share/contrib/pg_dict*, but you can of course store this anywhere you like as long as the backend can read the file.



With this done, you need to add the dictionary to the list of available ones in tsearch2, using a command like this (this assumes you have installed tsearch2 in your database already):

<code style="white-space:pre">
INSERT INTO pg_ts_dict (SELECT 'pg_dict',dict_init,

 '/usr/local/pgsql/share/contrib/pg_dict',

 dict_lexize, 'pg-specific dictionary'

 FROM pg_ts_dict WHERE dict_name='synonym'

);

At this time, you should be able to test the new dictionary, using the lexize function. Notice how any word defined in the dictionary returns the synonym, in this case postgres. [HTML_REMOVED] search=# select lexize('pg_dict','postgresql'),lexize('pg_dict','pgsql');

lexize | lexize

------------+------------

{postgres} | {postgres}

(1 row)



Finally, you need to modify the tsearch2 configuration for it. Normally you'll probalby be working off a custom configuration, but if you're working off the default one, the command below will activate the *pg_dict* dictionary along with the english stemmer for the default config:
<code style="white-space:pre">
UPDATE pg_ts_cfgmap SET dict_name='{pg_dict,en_stem}'

 WHERE tok_alias IN ( 'lword', 'lhword', 'lpart_hword' )

  AND ts_name='default';

With this done, you should now be able to verify that the system works properly by converting a string to a tsvector:

[HTML_REMOVED] search=# SELECT to_tsvector('postgresql');

to_tsvector


'postgres':1

(1 row) ```

And that's all there is to it. You can easily modify the dictionary by just editing the file. Note that you will need to re-index your tables (or more specifically, re-create the contents of the tsvector columns) if you change the dictionary and want those changes to be usable to search data. Also, the contents of the synonym dictionary is cached the first time it's used in a backend, so if you're using connection pooling you may need to forcibly close your connections in order for the changes to show up.

Finally, a big thanks to Oleg Bartunov for helping me sort out this configuration stuff in the first place, and the rapid fixing of the couple of issues I came across during the testing of 8.2.


Conferences

I speak at and organize conferences around Open Source in general and PostgreSQL in particular.

Upcoming

Stockholm PUG 2016/4
Aug 31, 2016
Stockholm, Sweden
Postgres Open
Sep 13-16, 2016
Dallas, USA
Postgres Vision 2016
Oct 11-Nov 13, 2016
San Francisco, USA
PGConf.EU 2016
Nov 1-Jan 4, 2016
Tallinn, Estonia
PGConf.Asia 2016
Dec 2-3, 2016
Tokyo, Japan

Past

PG Day'16 Russia
Jul 6-8, 2016
St Petersburg, Russia
PGDay UK 2016
Jul 05, 2016
London, UK
Stockholm PUG 2016/3
Jun 16, 2016
Stockholm, Sweden
PGCon
May 17-21, 2016
Ottawa, Canada
Pycon Sweden
May 9-10, 2016
Stockholm, Sweden
More past conferences