Encoding issues during PostgreSQL backup restore
I needed to migrate a PostgreSQL database from version 12.x running as a docker image on an AMD64 architecture to the latest 12.4 running on ARMv7.
This is the steps I took:
- On the source machine invoke:
docker exec source-container-db-name pg_dump -C -U db-user db-name > db.dump.plain
This takes a database dump in form of plain SQL statements that should be highly compatible. What I mean is - it’s SQL standard we’re talking about here and not some binary proprietary format that might be not inconsistent between versions of the same RDBMS.
- On the target machine invoke:
cat db.dump.plain | docker exec -i target-container-db-name psql -U db-user
This will read line by line from the input file (it contains pure SQL statements as mentioned above) and invoke it in the target container.
This sounds easy but I got some weird results like this:
invalid command \tMMoiey”“o</noK4>>Siciey0scTu>”<3S”D/ttIAq<>aieaufaa:ITtIot<tene”yireD/nlrFT:/no/noieyrFT:iey%�IOODSuiey/noaht<:ITtdou/aA/noIrrr<mtnmhroe<a=>8Coccd<dnf”0QlKm<a”IIT/61T”npOrzpo.ni=LIc.aHT”ra
invalid command \aul”ctnBrpOmTn-<n1DDao
invalid command \yealLntt<eoLntnii/soeaMn”inn:cHT]ehl1neaa7ht/ct”cea0DSnItiruNeLi”“or”0ayaG”>n>03”YMetrLnt/ctnl0iinctee”eirItnmq”eon4e<,
(200 lines of similar gibberish)
ERROR: invalid byte sequence for encoding “UTF8”: 0xff
Now that last line looks interesting. Quick check of the SQl dump file format revealed the truth:
file db.dump.plain db.dump.plain: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators
The endianness should not be a problem here but I have not idea where this UTF-16 came from.
In my case I was able to perform conversion from UTF-16 to UTF-8 using
iconv -f utf-16 -t utf-8 db.dump.plain > db.dump.plain-utf8
After seeding the converted file (UTF-8) into the command as previously it worked like a charm.
If you have access to the source database you can export it using explicit formatting like this:
docker exec source-container-db-name pg_dump -C -E UTF8 -U db-user db-name > db.dump.plain
Mind that if you’re invoking it on the Windows this might not work for you so you should create a file instead of using streams:
docker exec source-container-db-name pg_dump -C -E UTF8 -U db-user db-name -f db.dump.plain-utf8
You just need to now copy it from the docker machine to the host.
Hope it might help someone in the future (like me after a few months not remembering about this stuff ;-)