git internals

We all use git in our everyday life for our source code management. But have you ever wondered how git manages all the files under the hood? How git is blazing fast when you ask for a diff between a very old commit and a recent commit or when you're hopping on to other branches?

Let's start by understanding how git stores and manages objects (called git objects). Mainly, we are going to see blob objects, tree objects and commit objects. Knowing these concepts provides a little insight into how git works.

Object Database
Blob objects
git cat-file utility
Tree objects
Commit objects
TL;DR

Object Database

All the git objects are stored in .git/objects folder. All kinds of objects are given a hash id and are put in here. Hashes look something like 37d4e6c5c48ba0d245164c4e10d5f41140cab980, but git can recognize an object with its first 4 characters of the hash id. The first two characters are used as the directory name to not store all the objects in a single directory.

objects

find .git/objects
# Output
# .git/objects
# .git/objects/pack
# .git/objects/info
# .git/objects/37
# .git/objects/37/d4e6c5c48ba0d245164c4e10d5f41140cab980

Blob objects

Git hashes each of its files based only on its content and stores it in the object database. Two files with different names but with the same content will have the same hash. We can make use of git hash-object command to manually hash a file.

hash-object

git init test && cd test
echo 'hi there' > 1.txt
cp 1.txt 2.txt
git hash-object 1.txt 2.txt

# Output
# 37d4e6c5c48ba0d245164c4e10d5f41140cab980
# 37d4e6c5c48ba0d245164c4e10d5f41140cab980

We just hashed the files!

You can ask git to store this in object database using the -w flag. Note that, in the below example, we hashed and stored two files (1.txt and 2.txt) but git has only one object with hash 37d4e6. This is because the contents were the same and it stored only one object.

git hash-object -w 1.txt 2.txt
find .git/objects
# Output
# .git/objects/37
# .git/objects/37/d4e6c5c48ba0d245164c4e10d5f41140cab980
# 37d4, the first 4 chars can be used to refer this object.

git cat-file utility

Git provides a way to read git objects -- using the cat-file sub-command. The signature of this command is

git cat-file <function> <object>

The function parameter could be one of the below
-t
Print the type of object. The types could be blob, tree or commit.

-s
Print the size of the object.

-p
Pretty print the contents of the object. When you pretty print a blob object, only its contents are displayed. The hash-object only captures the content and not the file names. We will see how git tracks the file names in the tree objects section.

git cat-file -t 37d4
# blob
git cat-file -s 37d4
# 9
git cat-file -p 37d4
# hi there

Tree objects

The tree object solves the problem of storing the file name, its directory, and its permissions and keeps track of multiple objects. The structure of the tree object is similar to the Unix filesystem directory.

Before creating a tree, you have to tell git which files/objects are to be considered to build it. To do this, git uses index/staging area. We can move an object to the staging area using the below command, specifying file type and permissions, object hash and the file name.

git update-index --add --cacheinfo 100644 37d4e6c5c48ba0d245164c4e10d5f41140cab980 1.txt

This is exactly what git add does internally. After that, write-tree sub-command is used to build a tree from the files staged -- that gives us the tree hash that is stored in object database.

mkdir dir
echo 'My third file' > dir/3.txt
git update-index --add --cacheinfo 100644 37d4e6c5c48ba0d245164c4e10d5f41140cab980 1.txt # Staged 1.txt file
git add dir/3.txt # Staged 3.txt file
git write-tree
# 8668baaf3fce89aff6f06a6558b8e76da062725b  -- Hash of the tree we just built

# Traverse our new tree 8668
# See the contents of tree 8668
git cat-file -p 8668
# 100644 blob 37d4e6c5c48ba0d245164c4e10d5f41140cab980	1.txt
# 040000 tree 3570b712655f57e89e8691987621fc62a334a837	dir

# See the contents of tree 3570
git cat-file -p 3570
# 100644 blob dc2f4460d8f0ef118cb169442639f5dda8e14e82	3.txt

The contents of a tree have 4 columns.
The first column of the output is used to store permissions. 100644 - First three characters tell about the file - normal file (100), dir (040), symbolic link etc,.. The last three characters signify the Unix style permissions (in this case rw-r--r--).

The second column is used to infer the type of the object, followed by the third column, which is the object hash.

The fourth column has the file or the directory name.

Visualizing the tree 8668b

When we do a git add, git captures everything necessary to create a tree. It captures the file permission, file name and object hash.

As of now, the directory structure looks like below

├── 1.txt  # hi there
├── 2.txt  # hi there
└── dir
    └── 3.txt # My third file

find .git/objects
.git/objects
.git/objects/35
.git/objects/35/70b712655f57e89e8691987621fc62a334a837
.git/objects/pack
.git/objects/86
.git/objects/86/68baaf3fce89aff6f06a6558b8e76da062725b
.git/objects/info
.git/objects/37
.git/objects/37/d4e6c5c48ba0d245164c4e10d5f41140cab980
.git/objects/dc
.git/objects/dc/43ed8669d01ce53843e2fc282718ebe5d81232

Commit objects

The commit object stores data about when and how the repository looked when it was taken. Every commit references to a tree object + some metadata.

Now the tree, 8668baa, has all the info about 1.txt and 2.txt. Let's commit it.

echo 'First commit' | git commit-tree 8668baa
# 05660af31d0832b3abc2f8c5bd5e8f84cdd2e19c

git cat-file -p 05660
tree 8668baaf3fce89aff6f06a6558b8e76da062725b
author nethish <nethish259@gmail.com> 1690046888 +0530
committer nethish <nethish259@gmail.com> 1690046888 +0530

First commit

The commit 05660 has everything captured - The author, committer, commit message, time when the commit was done and the tree from which we can retrieve the snapshot of our files. We can keep creating new trees with commit-tree.

We can link two commits using the -p (parent) option. This creates commit history.

echo 'File 4' > 4.txt
git hash-object -w 4.txt
# 442dd61341346eca30ac164c2e8e4e22276d2a04
git update-index --add --cacheinfo 100644 442dd61341346eca30ac164c2e8e4e22276d2a04 4.txt
git write-tree
# 1d3b56cf688292dc32464167bd89f7f09f0ab867
echo 'Second commit' | git commit-tree 1d3b56c -p 05660
# ab2df061a117000f5584c7ec9786dcee3de0332b

# Pretty print second commit
git cat-file -p ab2df
# tree 1d3b56cf688292dc32464167bd89f7f09f0ab867
# parent 05660af31d0832b3abc2f8c5bd5e8f84cdd2e19c -- This references to the parent commit 05660 we created earlier.
# author nethish <nethish259@gmail.com> 1690047040 +0530
# committer nethish <nethish259@gmail.com> 1690047040 +0530

# Second commit


echo 'File 4 Changed!' > 4.txt
git hash-object -w 4.txt
# f58760d9441421439d06e7079b2292efffe64bb5
git update-index --add --cacheinfo 100644 f58760d9441421439d06e7079b2292efffe64bb5 4.txt
git write-tree
# 0544288d632f2d5714779587e2baa721f60d3501
echo 'Third commit' | git commit-tree 054428 -p ab2df
# 4c74203ba3995ebbfcd193d2660bbb94fe09bd28

The git log command traverses a given commit hash and displays its history. We made history!

git log 4c742
commit 4c74203ba3995ebbfcd193d2660bbb94fe09bd28
Author: nethish <nethish259@gmail.com>
Date:   Sat Jul 22 23:01:16 2023 +0530

    Third commit

commit ab2df061a117000f5584c7ec9786dcee3de0332b
Author: nethish <nethish259@gmail.com>
Date:   Sat Jul 22 23:00:40 2023 +0530

    Second commit

commit 05660af31d0832b3abc2f8c5bd5e8f84cdd2e19c
Author: nethish <nethish259@gmail.com>
Date:   Sat Jul 22 22:58:08 2023 +0530

    First commit

Visualizing the git commit 4c742

The hash of 1.txt and dir remains the same since no change was made. Hence these objects can be re-used across commits. Now we can say, each commit has a complete snapshot of the repository.

Did you notice that we changed 4.txt and an entire blob object got created? (blob f4876) Yes, git doesn't store the diff, it stores each version of the file completely separately. But, wouldn't this cost space? Can git not store only the diff between files to be more space efficient?

It can!

Initially git stores all these as loose objects (separate objects if there is a change). When there are too many loose objects around, git packs all the objects into something called "packfile" which groups the files using the name and only stores the diff of it, saving space. More on this in the official doc

TL;DR

Each commit has info on the author, the exact time of the commit and a tree object.
Each tree object points to either another tree or a blob object with other metadata like filename.
Each blob object has all the contents of the file.

Hence, each git commit is a complete snapshot of the repository. But, if a file or a folder is unchanged, then the objects are reused because their hash won't change.

To view the diff of commit X and commit Y for file Z, git takes the relevant object for file Z from both commit X and commit Y, then performs a diff on it. It's fast. Very fast.

Next time you commit your changes, next time you pull changes from remote, next time you view a diff, think about this...