2014-03-09 · in Tech Notes · 422 words

For my oldest projects, I had source snapshots and releases from before they were first imported into a version control system. With Git, it's possible to insert these into the history of the project.

I did this by creating a new branch that's not attached to the existing history, and then pasting the existing history onto the end of it. Since this requires rewriting the existing history, it's best to do this as part of a conversion, rather than messing up an existing published repository (grafting would be a better option in that case).

First, start a new history branch within the repository by creating an empty orphan commit, dated as early as possible:

git checkout --orphan history
git rm -rf .
GIT_AUTHOR_DATE="1970-01-01T00:00:00 +0000" \
GIT_COMMITTER_DATE="1970-01-01T00:00:00 +0000" \
git commit --allow-empty -m 'Initial empty commit.'

(We could start with a commit that imports our first tarball instead, but there's no harm in having an empty commit, and it makes editing the history again later a bit easier.)

Next we need to extract each of our tarballs, remove anything we don't want added to Git, commit the rest, and create any release tags we want. Doing this by hand gets tedious very quickly, so I wrote git-import-snapshots:

git-import-snapshots $(ls -Str ~/snapshots/project*.gz)

You'll definitely want to edit the script if you use it yourself — it needs to understand how you've named your snapshots.

In most cases I was able to use the modification date of the snapshot file to identify an appropriate commit date. In a few cases I couldn't do that; instead, I needed to look at the contents of the tarball and find the latest modification date of the files inside it:

tar tzvf snapshot.tar.gz | grep -v '/$' | sort -k 4

Viewing the branch with gitk history should now show an appropriate series of commits and tags. If not, tweak the script's rules and do it again.

Now we need to join the two histories together, by rewriting the master branch's first commit so that it follows the latest commit on the history branch:

ref=$(git rev-parse history)
git filter-branch -f \
    --parent-filter 'sed "s/^\$/-p '$ref'/"' \
    --tag-name-filter cat \
    master

While we're only really changing the first commit, all the subsequent ones will need to be rewritten too, in order to catch up with the hash changes. The --tag-name-filter cat option is required in order to preserve tags (i.e. rewrite them to point at the rewritten commits).

We can now check out the rewritten master branch, and throw away the history branch:

git checkout master
git branch -d history