- The difficulty of branching. We used tools like svnmerge to keep track of which parts of branches had been updated from trunk, and some of us on the core team used Git/Mercurial on top of Subversion, but this was all unnecessarily complicated -- to the point of being stifling.
- Lack of decentralization. When I would hack on Django on an airplane, for example, I couldn't make a bunch of commits locally, then push all of those to the master repository; I'd have to put everything in a single commit. With Subversion, it's all or nothing -- you push everything to a centralized server as you do it (or you use a branch, but that's painful, as noted above).
- Slowness. After you use Git for a while, Subversion feels sluggish. This is due to a bunch of design and implementation differences.
(Of course, it's 2012 now, and these are all obvious, well-documented points. To the people who responded to our GitHub news by saying "finally!" -- I totally agree.)
Aside from that, we had set up a GitHub mirror (now called django-old) a few years ago, and lots of people were getting code and forking it there anyway.
Why Git/GitHub, as opposed to Mercurial/Bitbucket or some other system? Because it's very well-made, and it's where the people are. Clearly GitHub has won the majority of open-source developers' mindshare. John Lennon said: "If I'd lived in Roman times, I'd have lived in Rome. Where else?" GitHub is Rome.
The authors file
The first thing we considered was to simply start using our existing GitHub mirror -- turn off the Subversion stuff and start committing there directly. But the problem there was that we'd never set up an authors file.
Basically, an authors file maps Subversion committer names to standard names and email addresses, so that GitHub knows that a commit by "adrian" in Subversion maps to the adrianholovaty GitHub account. With that mapping established, you get niceties like GitHub commits linking to appropriate GitHub user pages and displaying proper user avatar images. More importantly, it gives all of our contributors proper credit within the GitHub ecosystem for the full history of their work on Django -- which has value these days, considering companies are looking at GitHub involvement for job applicants, etc.
So the first step was creating that authors file, which Brian Rosner organized, with the help of several other people. We ended up accounting for every one of the 58 people who have ever committed to Django, except for somebody named "cell" who was given temporary commit access during a sprint six years ago.
One crucial detail is that we couldn't simply change the commit data retroactively in the existing GitHub repository. That's because Git uses the committer data in creating hashes. Changing the commit data would change the hashes, which would break all existing forks of that repository. (We ended up breaking existing forks anyway, of course, but it was cleaner to do it from scratch.)
Nuts and bolts of the process
Once we finalized the authors file, doing the migration was actually kind of easy, thanks to git-svn. I took many missteps along the way, got a lot of help from people in #django-dev on IRC and ended up doing three dry runs. Here are the final steps I ended up taking:
1. Copied the Subversion repository from code.djangoproject.com to my laptop, to make the migration faster.
# On the server: svnadmin dump /home/svn/django | gzip > svndump.gz # On my laptop: scp djangoproject.com:svndump.gz . gunzip svndump.gz svnadmin create /Users/adrian/code/django-svn svnadmin load /Users/adrian/code/django-svn < svndump
On my first run of
git-svn, I ran it from my laptop and pointed it at code.djangoproject.com, and it took 3.5 hours! After I copied the repo to my laptop and tried it again, it took a little over an hour. But the caveat here is that I also changed the
git-svn command between those two runs, so I'm not sure how much of the speed improvement was because of the local SVN repo.
2. Ran git-svn (with the correct arguments!).
git svn --authors-file=authors.txt --trunk=trunk clone file:///Users/adrian/code/django-svn/django/ django-dry-run
This took a little over an hour, and it created a Git repository called
django-dry-run. Note that
authors.txt is the authors file, as explained above.
The trickiest thing about this was determining the correct arguments to use -- specifically, whether to use
--branches explicitly or
--stdlayout. As you can see, I ended up using neither.
Originally, the plan was to migrate all of the branches from our Subversion history -- classics such as magic-removal, new-admin, newforms-admin, unicode, queryset-refactor and multidb -- so that the branches' commit histories (which have all since been merged to trunk) could be preserved in our new Git history. Many of those branches were very involved, with a lot of commits, and there's a lot of value in being able to isolate specific commits in the branch, rather than one large merge commit. (Imagine you're investigating the original reason we added a line of code, for example.)
But as we discussed this over IRC, we decided it wasn't worth the effort, we could always do it later and
git-svn wouldn't actually do it the way we wanted. Ideally, I'd like these branches' histories to be migrated such that they're treated like merged branches in Git -- a merge commit that knows the individual commits on the branch. If you know how to pull this off, and it can be done without altering the Git hashes, please let me know.
3. Changed git-svn-id to point at code.djangoproject.com instead of my laptop.
git filter-branch --msg-filter "sed \"s|^git-svn-id: file:///Users/adrian/code/django-svn/django/trunk|git-svn-id: http://code.djangoproject.com/svn/django/trunk|g\"" -- master
git-svn adds a "git-svn-id" section to each commit message in the resulting Git repository. It includes a URL pointing to the commit in the original Subversion repository, which is very useful.
But, because I did the import from a local repository, the git-svn-id's were all pointing at my laptop. So I ran
git filter-branch to clean it up.
4. Renamed old GitHub django repository to django-old.
(Done via the GitHub Web site.) This was the scary part, because it meant there was no turning back. :-)
Originally we'd talked about deleting the repository outright, but that would have deleted all pull requests and likely would have broken some other things. So I just renamed it to django-old. Not sure how long we'll keep this around.
4. Imported the new repository into GitHub.
git remote add origin email@example.com:django/django.git git push -u origin master
I spotted an error in the repository after the first time I did it, so I had to delete it -- which I thought made for a rare and amusing screenshot:
Then I cleaned up the repository and did it again. I mistakenly created it as a private repository, so I marked it as public, which led GitHub to believe I had just open-sourced Django. :-)
And that's it!
- Final number of commits in our Subversion repository: 17,942.
- Size of Subversion repository: 339 MB gzipped. (That's for the dump file as generated by
- Number of commits created in Git by git-svn: 11,883. (This is less than 17,942 because we only migrated trunk. Any commit to our repository that didn't touch Django trunk -- such as commits to the django_website project or commits to branches -- did not get migrated.)
- Number of forks of the old (mirror) GitHub repository, as of this writing: 783.
- The old Subversion repository will remain indefinitely, for the benefit of scripts out there that do automatic updates, and general stability of the Django world. There won't be any more commits there, obviously.
- If we ever need to dive into the history of one of the big merged branches -- such as magic-removal -- we can do so in the Subversion history. Or we can consider copying the branch history into Git somehow (see above).
- I'd like us to provide some documentation on how to convert your previous Django fork (from the django-old repository) to track the new repository. Any volunteers?
- We still have a bunch of work to do fixing places in our documentation and code.djangoproject.com that refer to Subversion. Bear with us.
Filing bugs / pull requests / the ticket system
But, of course, we want to take advantage of GitHub pull requests at the same time. So we'll need to figure out the right balance between pull requests and Trac tickets, such that we maintain our sanity, we don't make people jump through hoops, and we optimize for contributor and committer productivity.
Personally, I want to avoid a situation (and culture) where we force contributors to use Trac if they post pull requests, especially ones that contain trivial changes. But at the same time, it'll likely become a maintenance nightmare if we have lots of tickets in two places, with no coordination. So, this is an open issue we'll be working to figure out. Jacob has been working on a technological solution.
Thanks to all the people who helped with this transition, and I look forward to the much happier development and collaboration experiences we get with GitHub. The commits and pull requests I've already handled have been a pleasure.