Remove files from Git history using git-filter-repo

Many of you have probably been in a situation where you committed a file in your repository which you shouldn’t have done in the first place. For example a file with credentials or a crazy big file that made your repository clones very slow. Now there are a lot of blogs and guides already available on how to get these files completely removed. It involves git filter-branch or bfg sourcery. In this blog I’m going to show you the new recommended way of doing this using git-filter-repo, which simplifies the process a lot.

Recently I had to rewrite a repository that slowed down the CI pipeline due to some huge movies committed. In later commits those movies where removed again, but in Git they still exist and cause big slow repositories. In the past I have also committed some credential that I had to remove. Trust me, we have all been there. To get the files removed, I started with the git filter-branch approach I am used to for years. Then I noticed following message.

1
2
3
4
5
6
7
$ git filter-branch --index-filter 'git rm --cached --ignore-unmatch my-huge-video.mp4' HEAD
WARNING: git-filter-branch has a glut of gotchas generating mangled history
rewrites. Hit Ctrl-C before proceeding to abort, then use an
alternative filtering tool such as 'git filter-repo'
(https://github.com/newren/git-filter-repo/) instead. See the
filter-branch manual page for more details; to squelch this warning,
set FILTER_BRANCH_SQUELCH_WARNING=1.

Although the Git cli is giving us this warning there is not so much written on git-filter-repo yet. Also the documentation on Github is still referring to filter-branch. A reason for me to write and bring awareness of this tool.

After reading a bit on git-filter-repo I figured it is there for trivial rewriting usecases like removing a file entirely from history. This simplifies how you can remove a file entirely from your repository as you will have a simpler command at your fingertips as well you won’t have to run things like BFG for the final cleanups on the remote. For other usecases you can still use the more advanced features of git filter-branch, but for simple rewrites git-filter-repo is the recommended way of handling these kind of things today.

So how do we get started? We will first have to install git-filter-repo. This is as simple as running your package manager install command according to the documentation. Below how I did this on my Macbook utilizing Homebrew.

MacOS
1
brew install git-filter-repo

Now with the tooling installed I simply started over my process of removing the files entirely from the repository. But before showing this specific step in the process I want to guide you through the process from the beginning.

I’m not responsible for any mistakes or loss of data when performing this process on your own. To become a master at any skill, it takes the total effort of your: heart, mind, and soul.

We start by cloning the repository using the --mirror option which takes care of pulling all branch information etc. locally.

Cloning via --mirror does not give you a workspace to work locally.

1
2
3
4
5
6
7
8
$ git clone --mirror git@github.com:marcofranssen/my-repo-to-be-rewritten.git
Cloning into bare repository 'my-repo-to-be-rewritten'...
remote: Enumerating objects: 211, done.
remote: Counting objects: 100% (211/211), done.
remote: Compressing objects: 100% (108/108), done.
remote: Total 211 (delta 98), reused 205 (delta 95), pack-reused 0
Receiving objects: 100% (211/211), 50.09 MiB | 5.21 MiB/s, done.
Resolving deltas: 100% (98/98), done.

Next up I want to find the files that have been deleted in the past as I don’t remember all of them. With the following command I can easily find which files have been removed in previous commits.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ git log --diff-filter=D --summary
commit 22a5d9613f51821b08d54e1d96cdae7c7b15652a
Author: Marco Franssen <marco.franssen@gmail.com>
Date: Tue Jul 28 13:24:24 2020 +0100

Removed processed video data from repo and changed code to write outside of repo

delete mode 100644 data/my-video.mp4
delete mode 100644 data/my-other-video.mp4

commit 910a4e00df2044ad77feed05c1d1627490fa02b1
Author: Marco Franssen <marco.franssen@gmail.com>
Date: Wed Feb 26 15:09:31 2020 +0100

Renaming video-processor.py to video_processor.py

delete mode 100644 src/video-processor.py

commit ba73d5ecb5b18137f90e3eb186bb244a9368d7a5
Author: Marco Franssen <marco.franssen@gmail.com>
Date: Thu Mar 7 10:09:31 2020 +0100

Removed service credentials

delete mode 100644 .env

What we can see from above output is that I removed a .env file that apperantly contained some credentials, and I have removed some processed video’s from the repository. There was also a file renamed, but that is all fine. So now we know which files to remove we can start utilizing git filter-repo.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
$ git filter-repo --invert-paths --path .env
Parsed 197 commits
New history written in 0.11 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
Enumerating objects: 210, done.
Counting objects: 100% (210/210), done.
Delta compression using up to 12 threads
Compressing objects: 100% (127/127), done.
Writing objects: 100% (210/210), done.
Building bitmaps: 100% (48/48), done.
Total 210 (delta 98), reused 144 (delta 75), pack-reused 0
Completely finished after 0.64 seconds.
$ git filter-repo --invert-paths --path data/my-video.mp4
Parsed 197 commits
New history written in 0.09 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
Enumerating objects: 210, done.
Counting objects: 100% (210/210), done.
Delta compression using up to 12 threads
Compressing objects: 100% (127/127), done.
Writing objects: 100% (210/210), done.
Building bitmaps: 100% (48/48), done.
Total 210 (delta 98), reused 144 (delta 75), pack-reused 0
Completely finished after 0.64 seconds.
$ git filter-repo --invert-paths --path data/my-other-video.mp4
Parsed 197 commits
New history written in 0.10 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
Enumerating objects: 210, done.
Counting objects: 100% (210/210), done.
Delta compression using up to 12 threads
Compressing objects: 100% (127/127), done.
Writing objects: 100% (210/210), done.
Building bitmaps: 100% (48/48), done.
Total 210 (delta 98), reused 144 (delta 75), pack-reused 0
Completely finished after 0.64 seconds.

Now we have fully rewritten our repository and got rid of the files we accidentally committed. Now there are 2 steps left. Pushing the repository and informing your teammember to make a fresh clone of the repository.

Yes this time it will be fast as the big files are gone!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
git push --no-verify --mirror
Enumerating objects: 199, done.
Writing objects: 100% (199/199), 266.73 KiB | 3.03 MiB/s, done.
Total 199 (delta 0), reused 0 (delta 0), pack-reused 199
remote: Resolving deltas: 100% (96/96), done.
To github.com:marcofranssen/my-repo-to-be-rewritten.git
+ 0b09ac1...3c843c0 feature/ci -> feature/ci (forced update)
+ 175089f...7fa3dfb master -> master (forced update)
* [new reference] refs/replace/0362424174271cda267832b0ef3e78dfeb59ec72 -> refs/replace/0362424174271cda267832b0ef3e78dfeb59ec72
* [new reference] refs/replace/22a5d9613f51821b08d54e1d96cdae7c7b15652a -> refs/replace/22a5d9613f51821b08d54e1d96cdae7c7b15652a
* [new reference] refs/replace/e09dd373762b51f361d0033326982db62e3a524f -> refs/replace/e09dd373762b51f361d0033326982db62e3a524f
* [new reference] refs/replace/f1831654ceb3cad707d9d3f579a6aac14144e139 -> refs/replace/f1831654ceb3cad707d9d3f579a6aac14144e139
! [remote rejected] refs/pull/1/head -> refs/pull/1/head (deny updating a hidden ref)
! [remote rejected] refs/pull/2/head -> refs/pull/2/head (deny updating a hidden ref)
! [remote rejected] refs/pull/2/merge -> refs/pull/2/merge (deny updating a hidden ref)
! [remote rejected] refs/pull/3/head -> refs/pull/3/head (deny updating a hidden ref)
error: failed to push some refs to 'github.com:marcofranssen/my-repo-to-be-rewritten.git'

You can ignore the deny updating a hidden ref messages. Those are related to Github Pull Requests. If you check the repository on Github you will figure the new history is on Github and when cloning you will notice it clones much faster due to the huge files being removed entirely.

Bonus

To make things a bit easier I have also added an alias to my .gitconfig which allows me to easily find removed files from my Git history using a simpler commandline that is easier to remember.

1
git config --global alias.deleted "log --diff-filter=D --summary"

This allows me to type git deleted as opposed to git log --diff-filter=D --summary.

You can find my entire .gitconfig, containing a whole bunch of aliases in my dotfiles Github repository. These aliases save me a lot of typing and speed up my development process.

References

Thanks a lot for reading my blog. Please raise some awereness about this feature so others also have a less error prone way to cleanup their Git history.

Share