Manually comparing and synchronizing two folders can be tedious. Add long, confusing and very similar filenames and it’s no fun at all.
We recently faced a similar situation at work. Besides cryptic names, there was also a fair share of twisted logic governing the sync scenarios. We had to get our hands dirty, since the standard tools were useless.
Because we don’t have to do the sync often, and the folders have always been of reasonable size, automating the process seemed an overkill. However, as our products grow, so do the folders needing sync. We recently figured spending some time to create a script would be a good future investment.
This post builds upon ideas I came across writing the script, but were ultimately left out or done differently. It concludes with a simple tool capable of syncing two folders. The code was written using version 2.7.6.
Backing and reverting
Let’s assume nothing is under version control. It would be cool if our script would have a revert mechanism of sorts, in case something funky happens during sync. To keep it simple, maybe a directory snapshot is enough:
It seems dense, but it’s easy to explain. First, we create a repository folder or
REPO, if it doesn’t exist yet. It is where the backup snapshots live. We then save the archived directory under a unique name to
REPO. Tag is optional, but if given, we’ll be able to reference the archived file with it. The name is written into
backup_path and saved to
directory. Let’s call it the
BACKUP file. It’s there to help us revert:
In order to revert, we need to know the archived file to revert to. This is written in the
BACKUP file which should be part of
get_archive_name helps us extract it:
Note that any backed state is reachable by a sequence of reverts, as long as
REPO contains the appropriate archive:
We should also be able to revert using tags. Here’s one way to do it:
My initial thought was to serialize and de-serialize a dictionary, but performance would degrade quickly. Even with a bit of SQL, I’d argue the above is quite concise.
It’s also quite easy to show tag history of a directory:
There are a few things to consider when reverting:
- we’re blindly extracting archives without prior inspection. It is possible that files are created outside of path, e.g. filenames starting with two dots. This could be a security hazard.
- someone could backup the root directory, then try to recover it and happily wipe out the hard drive, since we don’t worry about the directory we’re clearing.
- if the directory contains symbolic links,
shutil.rmtree(directory)will throw an
The issues are somewhat easily fixable and might be a good exercise to try out.
Finding and applying the differences
Finding the differences between two directories couldn’t be simpler:
dst does not exist, then the difference is the
src directory content. Otherwise, there’s a handy module we can use:
filecmp. It contains a function
dircmp that does exactly what we need - it finds all the differences between two folders.
We’re interested in files or folders only in
src, or common files that differ. We also don’t want to copy any of the config files, so we filter them out.
This is how to apply the differences:
The code speaks for itself. The point to note is the backup we perform before any copying is done. This enables us to revert if something goes sour.
Syncing the same folders over and over and over…
Sometimes, you know beforehand the folders you need to sync. For example, you know that folder
A will always have to be synced with folders
C. This is where
SYNC file comes into play. It contains one or more source folders, each listed on a separate line.
In the example above, folder
A should contain the
SYNC file with the following content:
Then, all we need to do is sync the directory containing the
As you can see, it’s as straightforward as opening the
SYNC file, reading the sources, and then applying the differences.
Of course, we should provide means to generate such file:
To wrap what we’ve done in a simple utility tool, we should create a command line interface, so the user can interact with it.
argparse module makes this simple. Before turning to code, here’s what the user should be able to do:
1. Copy different files from one directory to another
cp -t sample_tag /source/path/dir /destination/path/dir
The comand requires a
src directory and a
dst directory, where
dst will be synced with
src. Tag is optional.
2. Revert a directory or tag
rv -t sample_tag or
rv -d /random/path/dir
3. Sync a directory
sync -t just_in_case /random/dir/path
/random/dir/path should contain the
SYNC file. Tag is optional.
4. Create a
mksync dir/path/where/sync/is/created /fst/src /snd/src /trd/src
SYNC file in the first specified directory. Any directory listed afterwards is added to the source list.
5. Show tag history
It shows the available revert tags.
Here’s the above in code:
Perhaps the only interesting thing in our parser is the custom action:
It ensures that any path we provide as an argument is an existing directory we can write to.
We can now sync two folders and revert if need be. The tool is very simple and only offers crude functionality, but it’s a good starting point to build upon. What always amazes me is the expressiveness of Python and what can be achieved with cca. 200 lines of code, half of which are paranoid asserts and param checks.
I’d also like to note that although I love Python, I don’t use it enough to consider myself a pythonista. If you spot any piece of code that can be replaced with a more standarized idiom, please let me know!
As always, there’s a GitHub repo where you can find the complete script.