Revamping the web search, part 1

Over the past couple of weeks (that's how I mostly do these things - an hour here, half an hour there, over an extended period of time), I've been working on revamping the search on and oday we're using a custom patched version of ASPSeek backed with a PostgreSQL database. Unfortunatly, it's a real pain to maintain - upstream didn't want the patches John wrote (if I understood the situation correctly), it requires a very special version of GCC to build, even the web interface is in C++ and thus a pain to edit for layout changes etc. Short story, time to look at something else.

The new solution I'm working on is based on PostgreSQL 8.2 with tsearch2 and GIN indexes. So far it's showing good performance, and a very good flexibility given that you get to use metadata in the PostgreSQL database to further enhance hits. Plus, the web interface can be integrated with the main site layout engine. Finally, the indexer is "context aware" and knows how to read our archives in a way

This has also taught me some bad things about the common languages/frameworks used out there, and their (non) dealing with encoding. Basically, the system needs to deal with multiple encodings (iso-8859-1, utf-8 etc etc), and more specifically with files that have broken encodings (such as claiming to be utf-8 but half the file is utf-8 and the other half iso-8859-1).

Initially, my indexer implementation was in Perl using LWP and HTML::Parser. Eventually I had to abandon this completely, because I could just not find any way to get Perl to ensure the output data was proper UTF-8, which is required to insert it into a PostgreSQL database with UTF8 encoding. I tried several different ways (after all, it's Perl, so you should do it different ways), but it always broke one way or another.

I've currently re-implemented most of the indexer in PHP instead. This does appear to work much better. The iconv() function actually works as advertised and can be set to always output clean UTF8 and just ignore broken encoding on input characters replacing them with blanks. Initially, I was using the Tidy extensions to PHP to parse the HTML, but had to give this one up because of the insane memory leaks (such as eating up a gigabyte of memory after indexing <10,000 pages - and I need to index more than 500,000). There's also a bug in 5.1.x at least wrt strtotime() that causes a coredump, but it appears to be fixed in 5.2.

Current version uses preg_match() with a couple of fairly simple regexps, and this appears to be working much better. It also gives significantly better performance than the Perl version, because all the "heavy duty" work is in C code linked into PHP, and not in interpreted code.

There are still some issues with the PHP indexer, but it looks a lot better. Will keep posting more info when I have it :-)


I speak at and organize conferences around Open Source in general and PostgreSQL in particular.


Mar 2-5, 2017
Pasadena, California, USA
Open Source Infrastructure @ SCALE
Mar 2, 2017
Pasadena, California, USA
Confoo Montreal 2017
Mar 8-10, 2017
Montreal, Canada
Nordic PGDay 2017
Mar 21, 2017
Stockholm, Sweden 2017
Mar 23, 2017
Paris, France
PGCon 2017
May 23-26, 2017
Ottawa, Canada


FOSDEM + PGDay 2017
Feb 2-4, 2017
Brussels, Belgium
PGConf.Asia 2016
Dec 2-3, 2016
Tokyo, Japan
Berlin PUG
Nov 17, 2016
Berlin, Germany
PGConf.EU 2016
Nov 1-4, 2016
Tallinn, Estonia
Stockholm PUG 2016/5
Oct 25, 2016
Stockholm, Sweden
More past conferences