Encoding issues during PostgreSQL backup restore

I needed to migrate a PostgreSQL database from version 12.x running as a docker image on an AMD64 architecture to the latest 12.4 running on ARMv7.

This is the steps I took:

  1. On the source machine invoke:
    docker exec source-container-db-name pg_dump -C -U db-user db-name > db.dump.plain
    

This takes a database dump in form of plain SQL statements that should be highly compatible. What I mean is - it’s SQL standard we’re talking about here and not some binary proprietary format that might be not inconsistent between versions of the same RDBMS.

  1. On the target machine invoke:
cat db.dump.plain | docker exec -i target-container-db-name psql -U db-user

This will read line by line from the input file (it contains pure SQL statements as mentioned above) and invoke it in the target container.

The Problem

This sounds easy but I got some weird results like this:

invalid command \tMMoiey”“o</noK4>>Siciey0scTu>”<3S”D/ttIAq<>aieaufaa:ITtIot<tene”yireD/nlrFT:/no/noieyrFT:iey%�IOODSuiey/noaht<:ITtdou/aA/noIrrr<mtnmhroe<a=>8Coccd<dnf”0QlKm<a”IIT/61T”npOrzpo.ni=LIc.aHT”ra
invalid command \aul”ctnBrpOmTn-<n1DDao
invalid command \yealLntt<eoLntnii/soeaMn”inn:cHT]ehl1neaa7ht/ct”cea0DSnItiruNeLi”“or”0ayaG”>n>03”YMetrLnt/ctnl0iinctee”eirItnmq”eon4e<,

(200 lines of similar gibberish)

ERROR: invalid byte sequence for encoding “UTF8”: 0xff

Solution

Now that last line looks interesting. Quick check of the SQl dump file format revealed the truth:

file db.dump.plain
db.dump.plain: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators

The endianness should not be a problem here but I have not idea where this UTF-16 came from.

In my case I was able to perform conversion from UTF-16 to UTF-8 using iconv:

 iconv -f utf-16 -t utf-8 db.dump.plain > db.dump.plain-utf8

After seeding the converted file (UTF-8) into the command as previously it worked like a charm.

Another Solution

If you have access to the source database you can export it using explicit formatting like this:

docker exec source-container-db-name pg_dump -C -E UTF8 -U db-user db-name > db.dump.plain

Mind that if you’re invoking it on the Windows this might not work for you so you should create a file instead of using streams:

docker exec source-container-db-name pg_dump -C -E UTF8 -U db-user db-name -f db.dump.plain-utf8

You just need to now copy it from the docker machine to the host.

Hope it might help someone in the future (like me after a few months not remembering about this stuff ;-)