Common Sense Development

pg-trgm and Text Similarity Search

When browsing the default Postgres extensions (a perfectly normal activity), I discovered pg_trgm. After a bit more reading, I'm convinced this simple database extension should be an essential part of any database engineer's toolkit.

From the documentation:

The pg_trgm module provides functions and operators for determining the similarity of alphanumeric text based on trigram matching, as well as index operator classes that support fast searching for similar strings.

Trigram matching is its own algorithm, but in short it breaks strings into three-character chunks and counts how many chunks are shared.

You can either compare two strings directly:

select similarity('shiori novella', 'murasaki shion') as similarity;
similarity
-----------
0.153846

Or find substrings within a larger text:

select word_similarity ('potato', 'two potatos for the harvest festival') as word_similarity;
word_similarity
--------------
0.857143

The real power comes when you start filtering rows. This is much more flexible than something like an WHERE name ILIKE 'regex', but without the added complications of full-text search.

select * from users where name % 'John';

The exact threshold that determines when a string is "matching" is controlled from the Postgres configuration. Or you could put the calculated 'distance' value in a column and manipulate it directly.

select *, name <-> 'John' as john_distance from users;

Name search that can handle misspellings, finding mentions of a specific word that accounts for things like irregular plurals or prefixes...

If you've ever worked in a system where you searched for "tickets" but the thing you actually needed was labeled "tocket", then you need more flexible search. But if you don't want to invest in true full-text search, this is a simple alternative.

If you're already running Postgres, you have all of this built-in. Just enable the extension and experiment!

For more reading, check out the official documentation.