Recently spent over a week in total (maybe 2 days net time) to realize that I was hitting this postgres bug – https://bugs.launchpad.net/ubuntu/+source/postgresql-9.5/+bug/1649877 . Documenting this just in case for the future, as it was pretty non-trivial to find that this was the issue I had.
The weird part was that it started happening maybe only a month ago with some recent Ubuntu 16.04 updates. As such it was really hard to detect, as it was working fine with almost same configuration before. The issue seems to be happening in some combinations of PostgreSQL / Ubuntu 16 (i.e. postgresql 9.5.x) and depends on Ubuntu 16 patch level.
The fix is trivial – need to change postgres user id and group id to be less than 1000. How to do that – described well here – https://www.cyberciti.biz/faq/linux-change-user-group-uid-gid-for-all-owned-files/
Note, that while applying the user id and group id fix, no processes may run as postgres user. Therefore, to ensure zero downtime, database needs to be failed over while applying the fix.