Pop Quiz: How do you verify that one filesystem is a subset of another?

So, this is an interesting problem which my fiancee gave me recently, by killing off my X session during the middle of an 'mv' operation which was repopulating my RAID array.

I was left with an interesting state of affairs - I had a bunch of files in one directory, and a bunch of files in the other. I used rsync to ensure that all the files still on the storage volume had been moved over to the RAID array, but something made me want a bit more confirmation before deciding to just delete 316GB of material.

First, I knew that I could count on rsync to have ensured that all the checksums were correct. So the challenge became: how was I going to verify that for every file X on 'smallvol', there is a corresponding file X on 'largevol'? If I expected the two filesystems to be mirrors of each other, this would be easy... just do two 'find .' operations, and then run them through sort (just to be sure they wound up in the same order) and then diff them against each other. But in this case, the contents of 'largevol' are hopefully a superset of 'smallvol'. One freebie: you aren't allowed to grep the diff for '<' entries. We're assuming a slightly brain-damaged egrep / shell combination that likes to try and mistake that for an attempt at output redirection instead. Alternatively, you may assume that you're on an old Solaris box which doesn't like regular expressions in Grep.

If your browser supports Javascript, you may click the 'more' down below to expose hints or the solution.

Hint: Commands used

More... Close 'find' (from the top level of both volumes), 'sort', 'uniq', and 'grep'

Hint: Math Trick

More... Close Think in binary. And remember that the contents of your 'find' output, is just a text file.


More... Close
  • cd /largevol;find . >/tmp/filelist - here, we create a list of the files which appear in largevol.
  • cd /smallvol;find . >>/tmp/filelist;find . >>/tmp/filelist - here, we add the files which appear in smallvol, to /tmp/filelist. Twice! There's a reason for this.
  • sort /tmp/filelist | uniq -c >/tmp/counts - now, we sort all the lines in order (so that uniq can process them correctly) and get a count of the lines.
  • grep -e '^\s+2\s+' - this will print a list of all the files which appear exactly twice. Not once (which means it only appears in largevol, which we hope is a superset of smallvol) or three times (which means it appears in both smallvol and largevol). If you're running the old Solaris box I mentioned above, you can search for ' 2 ' and the only adverse affect will be that files with ' 2 ' in them (common with media files) will show up in the results too.

-- SeanNewton - 08 Jan 2008

Topic revision: r1 - 2008-01-09 - SeanNewton
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback