Learn system design from Git - Immutable content-addressable datastore
Last updated
Was this helpful?
Last updated
Was this helpful?
“Hi Jin, could you recommend a good source to learn system design?”, Tommy asked one day. “Sure. I would say Git - learn the design decisions behind Git and its evolution from a specific tool for Linux kernel development to a general purpose source control system that dominates software development nowadays.” I said, “I have been rereading some old books about Git recently. After 20 years of Git’s inception, I am amazed as ever on the grace of it’s elegant design.” “Really? The Git we use every day? I never thought it is a system design study target.” Tommy looked unconvinced. “Well, let’s spend a few talks on some unique architecture design ideas behind Git. “ Today, we will focus on Git as a content-addressable, immutable datastore with a version control user interface written on top of it. Git stores the content as a single object file per piece of content, named with the SHA-1 checksum of the content and its header. The subdirectory to store the file is named with the first 2 characters of the SHA-1, and the filename is the remaining 38 characters. Since the file’s name is derived from its content, not the name users give to the content - for example, Git author made two profound architectural decisions: 1. Decouple object content from its external mutable name given by users. This is the so called “content addressable” design. 2. Content and its addrresable ID are both immutable. There is no Update and Delete of CRUD to Git objects, only Create and Read are allowed (Delete is an edge case we can ignore for now). The datastore is in cleanly defined layers. You can explore it under .git/ The objects directory is a key-value data store that holds all the content for the repository, including blobs, trees, and commits. Blobs represent file data, trees represent directories, and commits represent the state of the repository at a given point in time. These objects are immutable. Instead of altering existing objects, new objects are created to represent updated data, and the old objects are preserved. This design is crucial in ensuring the consistency and integrity of the repository's history. The refs directory stores pointers into commit objects, which represent branches, tags, remotes, and other references. HEAD file points to the branch that is currently checked out, providing a reference to the latest commit in the active branch. … That it! A layer of immutable content addressable objects and a layer of mutable pointers. “A complex system that works is invariably found to have evolved from a simple system that works. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work.” - John Gall