Monorepos have grown in popularity in recent years. For many of us, they are a
part of our daily Git workflows. The trouble is working with them can be slow. Speeding up
a developer's workflow can reap huge savings in the long run for any team.
First, a word about monorepos. What does it mean for a repository to be a
monorepo anyway? Well, it depends who you ask and the definition has become
more flexible over time, but here are a few.
Characteristics of monorepos
Monorepos have the following characteristics.
Multiple sub-projects
The typical definition of "monorepo" is a repository that contains multiple sub-projects. For instance, let's imagine a repository with a web-facing front end,
a backend, an iOS app directory, and an android app directory:
awesome-app/
|
|--backend/
|
|--web-frontend/
|
|--app-ios/
|
|--app-android/
awesome-app
is a single repository:
git clone https://my-favorite-git-hosting-service.com/awesome-app.git
The Chromium repository is a good
example of this.
Large files
Repositories can also grow to be very large if large files are checked in. In
some cases, binaries or other large assets such as images are checked into the
repository to have their history tracked. Other times, large files are inadvertently
introduced into the repository. The way Git history works, even if these files are
immediately removed, the single version that was checked in remains.
Old projects with deep histories
While Git is very good at compressing text files, when a Git repository has a deep history,
the need to keep all versions of a file around can cause the size of the repository to be huge.
The Linux repository is a good example of this.
For instance, the Linux project's first Git commit is from April 2005.
And a git rev-list --all --count
gives us 1,120,826 commits! That's a lot of
history! Getting into Git internals a little bit, Git keeps a commit object, and a
tree object for each commit, as well as a copy of the files at that snapshot
in history. This means a deep Git history means a lot of Git data.
Speeding up your Git workflow
Here are some features to help speed up your Git workflow.
Sparse checkout
git sparse checkout reduces the
number of files you check out to a subset of the repository. (NOTE: This feature
in Git is still marked experimental.) This is especially useful in the case of
many sub-projects in a repository.
Taking our example of a monorepo with multiple
sub-projects, let's say that as a front-end web developer I only need to make
changes to web-frontend/
.
> git clone --no-checkout https://my-favorite-git-hosting-service.com/awesome-app.git
> cd awesome-app
> git sparse-checkout set web-frontend
> git checkout
Your branch is up to date with 'origin/master'.
> ls
> web-frontend README.md
Or, if you've already checked out a worktree, sparse checkout can be used to remove
files from the worktree.
> git clone https://my-favorite-git-hosting-service.com/awesome-app.git
> cd awesome-app
> ls
> backend web-frontend app-ios app-android README.md
> git sparse-checkout set web-frontend
Updating files: 100% (103452/103452), done.
> ls
> web-frontend README.md
Sparse checkout will only include the directories indicated, plus all files
directly under the root repository directory.
This way, we only checkout the directories that we need, saving both space locally
and time since each time git pull
is done, only files that are checked out will
need to be updated.
More information can be found in the docs
for sparse checkout.
Partial clone
git partial clone has a similar goal to sparse checkout in reducing the number
of files in your local Git repository. It provides the option to filter out
certain types of files when cloning.
Partial clone is used by passing the --filter
option to git-clone
.
git clone --filter=blob:limit=10m
This will exclude any files over 10 megabytes from being copied to the local
repository. A full list of supported filters are included in the
[docs for git-rev-list](https://git-scm.com/docs/git-rev-list#Documentation/git-rev-list.txt