Let's imagine we need a database that would allow us to query for all the versions of a record since that record was created. For instance, we've got a note-taking app and we want to provide a history feature to the users. This could be solved by creating a history table that contains all previous versions of a record or by using a DBMS that already provides this, such as Microsoft SQL Server. However, what if Git was used instead?
Think about it. Git has refs (tables) that point to commits (DB operations), which contain blobs (records) and information about the previous one (history). In other words, given a table, we can get the most recent state of a record and use it to reach previous versions. That's exactly what we need and what we'll talk about in this post.
First, let's review a few things about Git.
Object database and index
As stated in the Pro Git book, Git is a content-addressable filesystem. So, we can store a file and refer to it by the hash of its content. Git has object types, and the one referring to a file is called blob. We cannot represent our whole project solely by blobs, though. We also need the filenames and a way to represent directories, and that's where the tree object is used. A node in a tree contains a name, a filemode (simillar, but not equal to Unix's) and the hash of an object, which can be a blob or another tree.
In a general Git workflow, the user doesn't create blobs and trees directly. Instead, the user creates a commit, which is another type of object. A commit contains the hash of a tree, an author and a committer, a date for each of them, an optional message and the hash(s) of its parent(s), if it's not the root commit. That's how git log
is able to show us a history of commits. The command takes a revision range (e.g. a ref, a commit, a range of commits) and use it to start traversing the commits tree.
So, if the user doesn't create blobs and trees directly, but commits instead, where does the tree used by the next commit comes from? The answer for that is another data structure managed by Git: the index. The index is located at .git/index
and contains a representation of the tree used by the next commit. When git commit
is called, a tree is created from the index, then this tree is used to create a commit. The format of the file is specified in index-format.txt, and it's not in the scope of this post to discuss the specifics of the file.
The last type of object is the tag. Tags in Git are used to refer to an object by another name. There are two types of tags: lightweight and annotated. The former is just a ref pointing to a commit and doesn't create an object, while the latter is an object containing when the tag was created, who created it and a message. An annotated tag also creates a ref the same way as a lightweight one does.
Refs
Git has an object database whose items can be addressed by the hash of their contents, but there's still something missing. I mean, how can you know the hash of the last commit or how can Git know what commit to use as the parent of your next one? For that, there are refs. Refs are files that contain the hash of a commit or the name of another ref (symbolic ref) and, apart from a few, are located at .git/refs
.
There are some special refs, including:
refs/heads/*
: branches.refs/remotes/*
: remote-tracking branches.refs/tags/*
: tag refs.HEAD
: ref that either points to a branch, making it the current/active branch, or to a commit, in which case the repository would be in detached HEAD state.
Bare repositories
Up until this point we considered git-related files to be located under the .git
directory, and that's generally the case, but we can also create a repository where the .git
directory and the one where the working tree is are the same. The working tree is the current state of the directory of your project, but sometimes this tree is not necessary. Sometimes you don't need to be able to edit the files and just want to use this repository as a remote. That's what a bare repository is for.
A bare repository is created by calling git init --bare
, and you can only interact with it by using it as a remote or by using plumbing commands.
Plumbing and porcelain
You usually interact with a repository by using commands like commit
, branch
, checkout
etc. These commands are called porcelain commands and abstract some operations done by Git so that it's easier to use. The other type of commands is called plumbing, and each of these commands is more focused on performing a specific operation on the filesystem. As an example:
Using porcelain commands
# Create the file
echo foo > bar
# Create an object from the content of the bar file.
# Add the new object to the index.
git add bar
# Create a tree from the current index.
# Create a commit object from the new tree whose parent is
# the commit pointed by HEAD after dereferencing it, if there's one.
# Update the branch pointed by HEAD to point to the new commit.
git commit -m "Add bar"
Using plumbing commands
Note: If you run the commands below, the hashes outputted by git commit-tree
won't be the same. Remember, the hash of a commit is based on its content.
# Create the file
echo foo > bar
# Create an object from the content of the bar file.
git hash-object -w -- bar # 257cc5642cb1a054f08cc83f2d943e56fd3ebe99
# Add the new object to the index.
git update-index --add \
--cacheinfo 100644,257cc5642cb1a054f08cc83f2d943e56fd3ebe99,bar
# Create a tree from the current index.
git write-tree # efbc17e61e746dad5c834bcb94869ba66b6264f9
# Create a commit object from the new tree whose parent is
# the commit pointed by HEAD after dereferencing it, if there's one.
if [ "$(git show-ref --head -s HEAD)" ]; then
git commit-tree efbc17e61e746dad5c834bcb94869ba66b6264f9 \
-m "Add bar" \
-p "$(git show-ref --head -s HEAD)"
# ca541533f0062a19e4dfc21663c1c9d8eebba127
else
git commit-tree efbc17e61e746dad5c834bcb94869ba66b6264f9 \
-m "Add bar"
# ca541533f0062a19e4dfc21663c1c9d8eebba127
fi
# Update the branch pointed by HEAD to point to the new commit.
git update-ref HEAD ca541533f0062a19e4dfc21663c1c9d8eebba127
Note: The100644
passed togit commit-tree
refers to the mode of the index entry, as specified in index-format.txt.
I think this recap of some Git concepts is enough to get us started on the database. Note that there's a lot more Git concepts than the ones presented above, so take a look at the Git reference or the Pro Git book if you want to know more.
The DB
Let's start by creating the database itself:
git init --bare
It's a bare repository because there's no point in having two versions of the data. We just need the one stored in the object database. Speaking of which, let's create our first record:
echo '{"name": "Foo"}' | git hash-object -w --stdin
# 11a78487c8c5924f7d05f05ee223898fc6608cf4
We can use the cat-file
command to check if the record was saved correctly:
git cat-file -p 11a78487c8c5924f7d05f05ee223898fc6608cf4
# {"name": "Foo"}
The cat-file
command provides information about stored objects, like type, size and content. The -p
flag asks for the content of the object to be printed based on its type, e.g. print a list of nodes when given a tree or print the raw content given a blob.
Let's not forget that we've stored the content of the record, but not its name. For that, we need a tree, and the most straightforward way to do that is to create one from the index. So, let's add our new object to the index:
git update-index --add \
--cacheinfo 100644,11a78487c8c5924f7d05f05ee223898fc6608cf4,data
It was briefly talked about before, but let's now understand where the 100644
number, the mode of the index entry, comes from. The mode for an index entry is a 32-bit number that starts with the following bits:
- 4 bits for the object type.
- 3 bits unused.
- 9 bits for the Unix permission (only
0000
,0755
and0644
are accepted).
The 100644
number is an octal number and its binary representation is 1000000110100100
. It means:
1000
: object type that represents a regular file.000
: three unused bits.110100100
: a file than can be read and written by the owner and only read by the group and other users (0644
in octal).
After adding the file to the index, we create a tree from it:
git write-tree
# f1db34daa05612f5e50f855715065cf26c929b19
Now we make our first commit. We don't need to check for the existence of a commit pointed by HEAD after dereferencing it because we know this is the first one. If we did, we'd use the version with an if
statement presented when talking about plumbing commands.
echo -n | GIT_AUTHOR_NAME="Foo" \
GIT_AUTHOR_EMAIL="foo@bar.com" \
GIT_AUTHOR_DATE="2020-06-16T13:00:00Z" \
GIT_COMMITTER_NAME="Foo" \
GIT_COMMITTER_EMAIL="foo@bar.com" \
GIT_COMMITTER_DATE="2020-06-16T13:00:00Z" \
git commit-tree f1db34daa05612f5e50f855715065cf26c929b19
# 383f6fb5445bd2dd84b5c2b52d80565b8973d111
This time, we used commit-tree
with environment variables. These variables are used to tell commit-tree
to not use the default values when building the commit, making the outputted hash the same no matter where or when the command is run. Instead of passing a message using -m
, we passed it through stdin by piping the output of the echo -n
command with git commit-tree
. Also, we're not passing any argument to echo
to make this commit have an empty message, and -n
tells the command to not append a \n
character to the output. To be sure, let's check if the commit has indeed an empty message by doing a hexdump
of the commit object's file.
pigz -c -z -d objects/38/3f6fb5445bd2dd84b5c2b52d80565b8973d111 | hexdump -C
# 00000000 63 6f 6d 6d 69 74 20 31 33 34 00 74 72 65 65 20 |commit 134.tree |
# 00000010 66 31 64 62 33 34 64 61 61 30 35 36 31 32 66 35 |f1db34daa05612f5|
# 00000020 65 35 30 66 38 35 35 37 31 35 30 36 35 63 66 32 |e50f855715065cf2|
# 00000030 36 63 39 32 39 62 31 39 0a 61 75 74 68 6f 72 20 |6c929b19.author |
# 00000040 46 6f 6f 20 3c 66 6f 6f 40 62 61 72 2e 63 6f 6d |Foo <foo@bar.com|
# 00000050 3e 20 31 35 39 32 33 31 32 34 30 30 20 2b 30 30 |> 1592312400 +00|
# 00000060 30 30 0a 63 6f 6d 6d 69 74 74 65 72 20 46 6f 6f |00.committer Foo|
# 00000070 20 3c 66 6f 6f 40 62 61 72 2e 63 6f 6d 3e 20 31 | <foo@bar.com> 1|
# 00000080 35 39 32 33 31 32 34 30 30 20 2b 30 30 30 30 0a |592312400 +0000.|
# 00000090 0a |.|
# 00000091
Any object file in Git starts with a header containing the object type (commit
), a space, the size of the object in bytes (134
) and a NUL
character (the .
after 134
). The concatenation of the header and the object's content is then compressed by zlib and that's what ends up being stored in the file and why we piped pigz
to hexdump
. After the committer's date, there are two .
characters. If you look at the hex table, you'll see that both of these dots refer to the same hexadecimal number, 0x0a
, which represents the \n
(new line, line feed) character. The commit message is placed after these two line feeds and, as you can see, there's nothing after them in our newly-created commit.
The last step is to create the ref:
git update-ref refs/note1 383f6fb5445bd2dd84b5c2b52d80565b8973d111
And that's it for the database. If we were to create a new version for note1
, we'd do the same steps, but would also pass a parent commit when running commit-tree
.
About libgit2
It was not talked about in this post how to integrate the database with the note-taking app. This is because we wouldn't use Git commands for that, but a Git implementation, such as libgit2, instead. Using commands would be very error-prone because of, among other things, having to parse the output of a command like git log
. I've started a project that uses git2go, a Go package that provides bindings for libgit2, and this will be the subject of the next post.