solovyov.net

PostgreSQL collation

4 min read · postgresql

I've got into a situation with PG I've never been into before. There is a financial reports table, containing some description of a transaction, with columns like date, amount and comment. And this comment field is often used to search for something case-insensitively. This is done best using where lower(comment) like '%some words%' using trigram index:

create index report_comment_lower_trgm on report 
  using gin (lower(comment) gin_trgm_ops);

But I've been looking for one concrete thing and couldn't find, even though I knew it's there:

finreport=# select count(*) from report where lower(comment) like '%кредиторськ%';
 count 
-------
     0

Yet, it turns out that just looking for a case-sensitive version works just fine:

finreport=# select count(*) from report where comment like '%Кредиторськ%';
 count 
-------
    40

What is even going on? After reading some articles, it seems like the main reason for this is an incorrect collation, so I went to check:

finreport=# select datcollate, datctype from pg_database where datname = 'asd';
 datcollate | datctype 
------------+----------
 C          | C

Wow, well, obviously C is not very correct when I'm looking at Ukrainian. Okay, is there a way to update collation? Some answers on Stack Overflow propose to just update pg_database, which is immediately did. Any guesses what happens next?

finreport=# select count(*) from report where lower(comment) like '%кредиторська%';
 count 
-------
    11

finreport=# select count(*) from report where lower(comment) like '%кредиторська%';
 count 
-------
     3

finreport=# select count(*) from report where lower(comment) like '%кредиторська%';
 count 
-------
     7

I had not expected this! Well, I did all the things you do when things are going wrong: re-created index, analyzed table, restarted the database — and the last one finally changed something! It started giving me 0 rows back again. :)

Solution

Recreate database anew, so just dump and restore:

# stop all the things which can change db here
$ pg_dump finreport > finreport.dump
# Postgres complained about "template" having other collation
# and proposed using "template0". I have no idea what's the
# difference, but it worked.
$ createdb -l uk_UA.UTF-8 -E utf-8 --template=template0 qwe
$ psql qwe -f finreport.dump

# renaming databases
$ psql postgres -c "select pg_terminate_backend(pid) from pg_stat_activity where datname = 'finreport'"
$ psql postgres -c "alter database finreport rename to asd"
$ psql postgres -c "alter database qwe rename to finreport"
# don't forget to start all the stuff here and drop asd

This obviously resets everything, and it now works as intended.

I'm uncertain if various UTF-8 collations will be different. I hope not because in other case I have no idea how to deal with multiple languages in a single Postgres db.

If you like what you read — subscribe to my Twitter, I always post links to new posts there. Or, in case you're an old school person longing for an ancient technology, put a link to my RSS feed in your feed reader (it's actually Atom feed, but who cares).

Other recent posts

Server-Sent Events (SSE), but with POST
ngrok for the wicked, or expose your ports comfortably
History Snapshotting in TwinSpark
Code streaming: hundred ounces of nuances