Moving Django to GitHub: the postmortem

Written by Adrian Holovaty on April 28, 2012

We finally moved Django to GitHub late yesterday. Here's a postmortem, to keep the community updated and for the benefit of any projects that take this leap in the future.

Background

We've used Subversion to manage our code since originally open-sourcing in July 2005. Over the last few years, we started to feel Subversion's limitations, namely:

(Of course, it's 2012 now, and these are all obvious, well-documented points. To the people who responded to our GitHub news by saying "finally!" -- I totally agree.)

Aside from that, we had set up a GitHub mirror (now called django-old) a few years ago, and lots of people were getting code and forking it there anyway.

Why Git/GitHub, as opposed to Mercurial/Bitbucket or some other system? Because it's very well-made, and it's where the people are. Clearly GitHub has won the majority of open-source developers' mindshare. John Lennon said: "If I'd lived in Roman times, I'd have lived in Rome. Where else?" GitHub is Rome.

The authors file

The first thing we considered was to simply start using our existing GitHub mirror -- turn off the Subversion stuff and start committing there directly. But the problem there was that we'd never set up an authors file.

Basically, an authors file maps Subversion committer names to standard names and email addresses, so that GitHub knows that a commit by "adrian" in Subversion maps to the adrianholovaty GitHub account. With that mapping established, you get niceties like GitHub commits linking to appropriate GitHub user pages and displaying proper user avatar images. More importantly, it gives all of our contributors proper credit within the GitHub ecosystem for the full history of their work on Django -- which has value these days, considering companies are looking at GitHub involvement for job applicants, etc.

So the first step was creating that authors file, which Brian Rosner organized, with the help of several other people. We ended up accounting for every one of the 58 people who have ever committed to Django, except for somebody named "cell" who was given temporary commit access during a sprint six years ago.

One crucial detail is that we couldn't simply change the commit data retroactively in the existing GitHub repository. That's because Git uses the committer data in creating hashes. Changing the commit data would change the hashes, which would break all existing forks of that repository. (We ended up breaking existing forks anyway, of course, but it was cleaner to do it from scratch.)

Nuts and bolts of the process

Once we finalized the authors file, doing the migration was actually kind of easy, thanks to git-svn. I took many missteps along the way, got a lot of help from people in #django-dev on IRC and ended up doing three dry runs. Here are the final steps I ended up taking:

1. Copied the Subversion repository from code.djangoproject.com to my laptop, to make the migration faster.

# On the server:
svnadmin dump /home/svn/django | gzip > svndump.gz

# On my laptop:
scp djangoproject.com:svndump.gz .
gunzip svndump.gz
svnadmin create /Users/adrian/code/django-svn
svnadmin load /Users/adrian/code/django-svn < svndump

On my first run of git-svn, I ran it from my laptop and pointed it at code.djangoproject.com, and it took 3.5 hours! After I copied the repo to my laptop and tried it again, it took a little over an hour. But the caveat here is that I also changed the git-svn command between those two runs, so I'm not sure how much of the speed improvement was because of the local SVN repo.

2. Ran git-svn (with the correct arguments!).

git svn --authors-file=authors.txt --trunk=trunk clone file:///Users/adrian/code/django-svn/django/ django-dry-run

This took a little over an hour, and it created a Git repository called django-dry-run. Note that authors.txt is the authors file, as explained above.

The trickiest thing about this was determining the correct arguments to use -- specifically, whether to use --branches explicitly or --stdlayout. As you can see, I ended up using neither.

Originally, the plan was to migrate all of the branches from our Subversion history -- classics such as magic-removal, new-admin, newforms-admin, unicode, queryset-refactor and multidb -- so that the branches' commit histories (which have all since been merged to trunk) could be preserved in our new Git history. Many of those branches were very involved, with a lot of commits, and there's a lot of value in being able to isolate specific commits in the branch, rather than one large merge commit. (Imagine you're investigating the original reason we added a line of code, for example.)

But as we discussed this over IRC, we decided it wasn't worth the effort, we could always do it later and git-svn wouldn't actually do it the way we wanted. Ideally, I'd like these branches' histories to be migrated such that they're treated like merged branches in Git -- a merge commit that knows the individual commits on the branch. If you know how to pull this off, and it can be done without altering the Git hashes, please let me know.

3. Changed git-svn-id to point at code.djangoproject.com instead of my laptop.

git filter-branch --msg-filter "sed \"s|^git-svn-id: file:///Users/adrian/code/django-svn/django/trunk|git-svn-id: http://code.djangoproject.com/svn/django/trunk|g\"" -- master

git-svn adds a "git-svn-id" section to each commit message in the resulting Git repository. It includes a URL pointing to the commit in the original Subversion repository, which is very useful.

But, because I did the import from a local repository, the git-svn-id's were all pointing at my laptop. So I ran git filter-branch to clean it up.

4. Renamed old GitHub django repository to django-old.

(Done via the GitHub Web site.) This was the scary part, because it meant there was no turning back. :-)

Originally we'd talked about deleting the repository outright, but that would have deleted all pull requests and likely would have broken some other things. So I just renamed it to django-old. Not sure how long we'll keep this around.

4. Imported the new repository into GitHub.

git remote add origin git@github.com:django/django.git
git push -u origin master

I spotted an error in the repository after the first time I did it, so I had to delete it -- which I thought made for a rare and amusing screenshot:

Screenshot of GitHub deletion step

Then I cleaned up the repository and did it again. I mistakenly created it as a private repository, so I marked it as public, which led GitHub to believe I had just open-sourced Django. :-)

Screenshot of GitHub upload

And that's it!

Stats

For posterity:

Going forward

Filing bugs / pull requests / the ticket system

GitHub's ticket system is a bit too simple for our needs, given the Django triage process, so we're sticking with our Trac installation, at least for the time being.

But, of course, we want to take advantage of GitHub pull requests at the same time. So we'll need to figure out the right balance between pull requests and Trac tickets, such that we maintain our sanity, we don't make people jump through hoops, and we optimize for contributor and committer productivity.

Personally, I want to avoid a situation (and culture) where we force contributors to use Trac if they post pull requests, especially ones that contain trivial changes. But at the same time, it'll likely become a maintenance nightmare if we have lots of tickets in two places, with no coordination. So, this is an open issue we'll be working to figure out. Jacob has been working on a technological solution.

Thanks to all the people who helped with this transition, and I look forward to the much happier development and collaboration experiences we get with GitHub. The commits and pull requests I've already handled have been a pleasure.

Comments

Posted by Dana Woodman on April 28, 2012 at 7:55 p.m.:

Thanks for the write up Adrian! It's nice to hear the details of why certain decisions were made. I appreciate your - and the Django community as a whole - openness and transparency. Hopefully the process of getting all the kinks worked out goes smoothly.

Posted by Brad on April 28, 2012 at 8:36 p.m.:

Thanks Adrian. I echo Dana's statement. We appreciate the work of you and others in the django community.

Posted by Brian Ray on April 28, 2012 at 9:11 p.m.:

That is pretty interesting. I do like git and github. I wonder how well it will work for such a large scale project like Django. Good luck.

Posted by Matt Todd on April 29, 2012 at 12:11 a.m.:

We're happy to have Django on GitHub! We think open source contributors will take to heart the old adage "when in Rome" and start forking and opening pull requests. :)

Posted by Animuchan on April 29, 2012 at 1:25 a.m.:

Nice. Subversion community mourns the loss :3 but the decision clearly was right.

Now that it became way easier to fork and pull-request, I'd guess there will be more incoming patches, too.

Post a comment