If you don’t git it first, try again

They say that necessity is the mother of invention and we have to be thankful to Larry McVoy for making it necessary for Linus Torvalds to invent git.

Larry McVoy sells a powerful version control system called BitKeeper and many years ago provided it free of cost to certain open-source projects. Linus Torvalds and his lieutenants used it for the Linux kernel. Then there was some disagreement between Linus and McVoy on the use of Bitkeeper and Linus decided to shop around for an alternative. Finding none that satisfied his requirements, he decided to write a distributed version control systems (dvcs) — and git was born. After git was mature enough, the Linux kernel’s version control was switched to git.

Some of the advantages of dvcs are:

  • You can work offline and make incremental commits.
  • You have the full history and changes to your project from the time you started
  • Since every clone of the project has all the history and changes, each is a backup of the complete repository. Therefore, the Linux Kernel (as do other projects using the other dvcs) are backed up in several thousands of places all over the world. There is no need to worry about loss of data (fire, earthquake, flood etc) like we have to with a centralized version control system like svn.
  • Merges are trivial in dvcs and a pain in centralized vcs like svn.

The 4 most popular open source dvcs available are git, mercurial (hg), bazaar (bzr) and darcs. git is written in C, the next 2 mainly in python and last one in haskell. Bazaar is sponsored by Canonical who provide Ubuntu. git is the fastest [^1] of them all. The initial version of darcs was quite sluggish when compared to the others, but this has improved considerably. Recently, Linus essentially rewrote SHA1 function from Mozilla and made it a lot faster. I assume that mercurial and bazaar will also benefit from this (though written in python, lot of the performance critical code is written in c)

Projects using git

  • Linux Kernel, Gnome, Android, postgresql, Ruby on Rails, Qt, Perl, Wine, Fedora, Debian, X.org, VLC, flightgear, sympy, Rpm, Prototype, …

Projects using mercurial

  • Mozilla, OpenOffice, OpenJdk, Netbeans, Vim, Xen, Symbian, OpenSolaris, Sage, Go language, …

Projects using bazaar

  • Ubuntu, MySql, Emacs, Gnash, Squid, Mailman, Gwibber, Inkscape, Stellarium, apt, …

Note that though Mercurial and Bazaar are written in Python, the Python repo still uses subversion. It will move mercurial soon.

I wouldn’t recommend darcs. You won’t go wrong if you choose git or mercurial or bazaar. These dvcs have a lot in common including the use of sha1 cryptographic hash. They require a higher learning curve than your centralized version control systems like svn, cvs.

My recommendation is to learn and use git for new projects that you own — for example your personal projects. git has lot more commands than mercurial. You need to put in greater effort to learn git when compared to mercurial or bazaar, but it is well worth the effort. It is a bit easier now as the documentation and tools have improved a lot. There are lot of tutorials and articles on git. Git Community book by Scott Chacon is a good place to start. There is an excellent 2 part video lecture on git by Bart Trojanowski. The #git irc channel irc.freenode.net is active and you can get help if you run into some issue or have questions. The basic commands will suffice in the beginning and the advanced commands will give you unmatched power and flexibility. If you are in a position to influence your manager or person in charge of selecting a vcs, you can suggest git and be able to provide justification.

Note that git runs perfectly fine on windows — it didn’t a few years back —- and so there is no reason to give that excuse for not using git. Cloning large repositories (to work on many features at the same time) may not be space-optimal on windows (if it doesn’t support hardlinks) but this is not usually necessary as git has light-weight branches. Hardlinks are a non-issue with newer filesystems on windows.

git is a content addressable system. A secure cryptographic hash function called SHA1 that generates a long 160 bit hash value, is used for this purpose. If a file — whatever its name — has content hello git! then its SHA1 hash [^2], or key, is

d1c64694584cf480b01273f2c729fd8b6b7c320c

The content hello git! is stored (in compressed form) in a file whose name is the key [^3]. This is called a blob. So, you can lookup the content from the key

git show d1c6469
hello git!

Cryptographic hash is essentially a random mapping [^4]. git will compute SHA1 hash for

hello git! ==> d1c64694584cf480b01273f2c729fd8b6b7c320c
hello git  ==> 8d0e41234f24b6da002d962a26c2495ea16a425f

Say that your project has 3 files a.c, b.c and c.c. You will have 3 blob objects for the content indexed by 3 SHA1 hashes. The files names will not be there in blobs as mentioned earlier. This hierarchy of content (in this case a single level) is stored in a tree object indexed by SHA1 hash. This hierarchy will have file-names for the content (a.c, b.c, c.c) along with the SHA1 value for the content. When you have sub-directories, the hierarchy of content will have each subdirectory name associated with the SHA1 of its tree-object.

git doesn’t track individual files. It tracks the whole project. When you want to checkpoint your current state of the project, you do an operation called commit. This will create a commit-object which as you can guess is indexed by another SHA1 hash. The input to the SHA1 hash is the SHA1 of the tree-object representing the hierarchy and meta-data such as the author, committer etc. Now after making more changes if you do another commit, another SHA1 hash is computed. In the SHA1 computation of this commit, the input also includes the SHA1 hash of the previous commit. Essentially you have a hash of hash of hash of etc etc. Say that your project has hundreds of thousand of files, and several hundreds of commits. You ask your remote colleague to make a clone of this via http.

All you have to do to verify that there has been no corruption (accidental or deliberate) is to ask your colleague to read out the first several bytes of the SHA1 index of the top-most commit. If they match with yours, then you know that he has an identical clone of your project. The importance of this cannot be overstated.

Reasons I prefer git:

  • index (staging area): Another benefit of content addressability.
    • If you are happy with your changes so far, you can stage those files and continue working. If you mess things up you can revert back to your staged state. However, if you are happy with your incremental changes, you can stage them (will overwrite previously staged data) and continue. You can even stage some of your changes (hunks) and not others. Some people (especially advocates of Mercurial) don’t like staging. You don’t need to use this feature but it is there if you want to. Of course, you can always perform a commit.
  • can have many (light-weight) branches in a repository
    • mercurial’s multiple-heads and named branches are not as convenient. [^5]
  • Commit early and often is a good mantra but you can end up with many commits which logically belong as one commit. You can squash these commits into one.
  • Amending commit: After a commit, if it turns out that you forgot to include a file, or the commit comment has a typo etc, you can amend the commit. Note that a commit is immutable and so effectively you are creating a new commit.
  • Can do multiple commits and rewrite history and reorder commits by doing a rebase. Trust me, you will need them sooner or later.
  • You can stash your changes and pop them back later. You can have several stashes.
  • Did I mention git is very fast?
  • Availability of github, a web-based hosting service for projects that use Git.

If the project you are working on uses subversion (svn) and you cannot convince the project-manager to switch to git, you can use git-svn, which acts as a gateway between git and svn. You can start using git for this project.

To learn more about the origins of git, check out the Google Tech talk from the predictably arrogant Linux Torvalds!

The only reason a company may want to use a centralized version control system like clearcase is if they want to have more control over the codebase and history — as there is a single, centralized repository. After all, with dvcs, every employee will have a complete repository, with all the history of commits. More disk space would also be needed for every developer. For e.g. the Linux kernel git repo is over 700 Megabytes.

git (and other dvcs) is not suitable when your project consists primarily of large binary files.

I hope I got you interested. Spend some time and learn to use a dvcs — git or mercurial or bazaar. Have fun.

[^1]: git needs periodic repacking of the objects. Easy to do.

[^2]: Actually, SHA1 hash is computed on blob 11\NULhello git!\n where 11 is length of hello git! plus 1.

[^3]: The content will be stored in file .git/objects/d1/c64694584cf480b01273f2c729fd8b6b7c320c

[^4]: Though they are theoretically infinite number of collisions (where 2 strings hash to the same 160 bit value), none have been found so far. If you hashed 2^80 pieces of data, chance of collision become 50%. So, no need to lose sleep over this.

[^5]: You can have multiple heads in mercurial, or have named branches but it is not the same. See this

Leave a Reply

Your email address will not be published. Required fields are marked *