Update (2020-03-06): Following this conversation on reddit with issue raised by u/atoponce I updated the result to include file renames and moves and added LC_ALL=C section.
Today I started building a new use case for Reliza Hub where we would match file system digest of the deployed directory to what we have in our metadata. We do such matching via sha256 hashes.
Previously we were mostly covering docker images or archive files where digest extraction was trivial. But this time around it’s a file system and sha256sum utility in linux does not have built-in option to compute digest on directory.
I first encountered this problem some time ago when we were building Reliza Hub Playground and corresponding monorepo sample repository. Use case was to integrate this command into GitHub Actions CI script so it would create releases of sub-projects in monorepo only if those projects actually changed.
To do so at CI run, script would call Reliza Hub to check if this sha256 was already registered, and only if it was not – then we would create a new release. So to get sha256 on directory back then I just did a quick DuckDuckGo search which brought to this superuser post and this askubuntu post. Switching to sha256sum from md5sum and sha1sum brought me to:
find /path/to/dir/ -type f -exec sha256sum {} \; | sha256sum
And this is what I initially used for Reliza Hub Playground Helper project. And this worked perfectly.
However, now when I started step 2 of the same workflow – where we promote same file system to the instance and need to match it from the instance to sha256 recorded on Reliza Hub side, I realized with disappointment that sha256 digest now suddenly doesn’t match when I executed the command above on the target instance. Another words, in the CI build and in the target instance I got 2 different sha256 values on the same git code base.
Why? After quick debugging I realized that first find command included file paths, and those were unsurprisingly different. To deal with that I used awk and left only digests of files, then calling sha256sum as following:
find /path/to/dir/ -type f -exec sha256sum {} \; | awk '{print $1}' | sha256sum
Looks good so far – but it still did not match! Another round of debugging – and I realized that on different machines find command would return files in different sorting order.
My next try to fix this was based on this superuser post trying to sort by date. But it quickly turned out that since we were using git clone frequently, dates on files were not matching either and sorting order was not universal.
Next idea I came with was to try to change sort to use file names and dictionary sort instead of dates. Surprisingly, that was also inconsistent across different Linux boxes (slightly, but enough to not get digests right). After further research, LC_ALL=C comes to the rescue here – as I further discovered.
So in the end after couple of hours overall, I came up with the following solution:
- Find all files in the directory and subdirectories using find -f
- Execute sha256sum on these files to get digests
- Use awk to only take digests from the previous command
- Sort those digests in the alphabetic order
- Only now compute final sha256sum on sorted digests
This worked and finally provided me with universal way to compute sha256 hash on directory across different platforms. Here is same thing in code:
find /path/to/dir/ -type f -exec sha256sum {} \; | awk '{print $1}' | sort -d | sha256sum | cut -d ' ' -f 1
Happy, I posted this on r/bash but as mentioned above luckily u/atoponce correctly pointed out that this solution would ignore file renames or moves within repository. He suggested great solution that is:
dir=<mydir>; (find "$dir" -type f -exec sha256sum {} +; find "$dir" -type d) | LC_ALL=C sort | sha256sum
That is great, but still we have an issue of absolute versus relative paths and digests computed differently based on those. I.e., dir=/home/myuser/path/to and cd /home/myuser && dir=path/to produce different sha256 hashes. To solve this I decided to use sed with regex following this stackoverflow. And the final-final solution I have at the moment is:
dir=<mydir>; find "$dir" -type f -exec sha256sum {} \; | sed "s~$dir~~g" | LC_ALL=C sort -d | sha256sum
That’s it! I also published details about the actual use case on medium.
Note, as mentioned in one of the medium comments – this is not cross-platform friendly (so it may not work cross-platform windows-to-linux due to difference in path structure).