I've got into a situation with PG I've never been into before. There is a financial reports table, containing some description of a transaction, with columns like date
, amount
and comment
. And this comment
field is often used to search for something case-insensitively. This is done best using where lower(comment) like '%some words%'
using trigram index:
create index report_comment_lower_trgm on report
using gin (lower(comment) gin_trgm_ops);
But I've been looking for one concrete thing and couldn't find, even though I knew it's there:
finreport=# select count(*) from report where lower(comment) like '%кредиторськ%';
count
-------
0
Yet, it turns out that just looking for a case-sensitive version works just fine:
finreport=# select count(*) from report where comment like '%Кредиторськ%';
count
-------
40
What is even going on? After reading some articles, it seems like the main reason for this is an incorrect collation, so I went to check:
finreport=# select datcollate, datctype from pg_database where datname = 'asd';
datcollate | datctype
------------+----------
C | C
Wow, well, obviously C
is not very correct when I'm looking at Ukrainian. Okay, is there a way to update collation? Some answers on Stack Overflow propose to just update pg_database
, which is immediately did. Any guesses what happens next?
finreport=# select count(*) from report where lower(comment) like '%кредиторська%';
count
-------
11
finreport=# select count(*) from report where lower(comment) like '%кредиторська%';
count
-------
3
finreport=# select count(*) from report where lower(comment) like '%кредиторська%';
count
-------
7
I had not expected this! Well, I did all the things you do when things are going wrong: re-created index, analyzed table, restarted the database — and the last one finally changed something! It started giving me 0 rows back again. :)
Solution
Recreate database anew, so just dump and restore:
# stop all the things which can change db here
$ pg_dump finreport > finreport.dump
# Postgres complained about "template" having other collation
# and proposed using "template0". I have no idea what's the
# difference, but it worked.
$ createdb -l uk_UA.UTF-8 -E utf-8 --template=template0 qwe
$ psql qwe -f finreport.dump
# renaming databases
$ psql postgres -c "select pg_terminate_backend(pid) from pg_stat_activity where datname = 'finreport'"
$ psql postgres -c "alter database finreport rename to asd"
$ psql postgres -c "alter database qwe rename to finreport"
# don't forget to start all the stuff here and drop asd
This obviously resets everything, and it now works as intended.
I'm uncertain if various UTF-8 collations will be different. I hope not because in other case I have no idea how to deal with multiple languages in a single Postgres db.