Blog.

Remove files from Git history using git-filter-repo

MF

Marco Franssen /

7 min read1399 words

Cover Image for Remove files from Git history using git-filter-repo

Many of you have probably been in a situation where you committed a file in your repository which you shouldn't have done in the first place. For example a file with credentials or a crazy big file that made your repository clones very slow. Now there are a lot of blogs and guides already available on how to get these files completely removed. It involves git filter-branch or bfg sourcery. In this blog I'm going to show you the new recommended way of doing this using git-filter-repo, which simplifies the process a lot.

Recently I had to rewrite a repository that slowed down the CI pipeline due to some huge movies committed. In later commits those movies where removed again, but in Git they still exist and cause big slow repositories. In the past I have also committed some credential that I had to remove. Trust me, we have all been there. To get the files removed, I started with the git filter-branch approach I am used to for years. Then I noticed following message.

$ git filter-branch --index-filter 'git rm --cached --ignore-unmatch my-huge-video.mp4' HEAD
WARNING: git-filter-branch has a glut of gotchas generating mangled history
     rewrites.  Hit Ctrl-C before proceeding to abort, then use an
     alternative filtering tool such as 'git filter-repo'
     (https://github.com/newren/git-filter-repo/) instead.  See the
     filter-branch manual page for more details; to squelch this warning,
     set FILTER_BRANCH_SQUELCH_WARNING=1.

Although the Git cli is giving us this warning there is not so much written on git-filter-repo yet. Also the documentation on Github is still referring to filter-branch. A reason for me to write and bring awareness of this tool.

After reading a bit on git-filter-repo I figured it is there for trivial rewriting usecases like removing a file entirely from history. This simplifies how you can remove a file entirely from your repository as you will have a simpler command at your fingertips as well you won't have to run things like BFG for the final cleanups on the remote. For other usecases you can still use the more advanced features of git filter-branch, but for simple rewrites git-filter-repo is the recommended way of handling these kind of things today.

So how do we get started? We will first have to install git-filter-repo. This is as simple as running your package manager install command according to the documentation. Below how I did this on my Macbook utilizing Homebrew.

brew install git-filter-repo

Now with the tooling installed I simply started over my process of removing the files entirely from the repository. But before showing this specific step in the process I want to guide you through the process from the beginning.

I'm not responsible for any mistakes or loss of data when performing this process on your own. To become a master at any skill, it takes the total effort of your: heart, mind, and soul.

We start by cloning the repository using the --mirror option which takes care of pulling all branch information etc. locally.

Cloning via --mirror does not give you a workspace to work locally.

$ git clone --mirror [email protected]:marcofranssen/my-repo-to-be-rewritten.git
Cloning into bare repository 'my-repo-to-be-rewritten'...
remote: Enumerating objects: 211, done.
remote: Counting objects: 100% (211/211), done.
remote: Compressing objects: 100% (108/108), done.
remote: Total 211 (delta 98), reused 205 (delta 95), pack-reused 0
Receiving objects: 100% (211/211), 50.09 MiB | 5.21 MiB/s, done.
Resolving deltas: 100% (98/98), done.

Next up I want to find the files that have been deleted in the past as I don't remember all of them. With the following command I can easily find which files have been removed in previous commits.

$ git log --diff-filter=D --summary
commit 22a5d9613f51821b08d54e1d96cdae7c7b15652a
Author: Marco Franssen <[email protected]>
Date:   Tue Jul 28 13:24:24 2020 +0100

    Removed processed video data from repo and changed code to write outside of repo

 delete mode 100644 data/my-video.mp4
 delete mode 100644 data/my-other-video.mp4

commit 910a4e00df2044ad77feed05c1d1627490fa02b1
Author: Marco Franssen <[email protected]>
Date:   Wed Feb 26 15:09:31 2020 +0100

    Renaming video-processor.py to video_processor.py

 delete mode 100644 src/video-processor.py

commit ba73d5ecb5b18137f90e3eb186bb244a9368d7a5
Author: Marco Franssen <[email protected]>
Date:   Thu Mar 7 10:09:31 2020 +0100

    Removed service credentials

 delete mode 100644 .env

What we can see from above output is that I removed a .env file that apperantly contained some credentials, and I have removed some processed video's from the repository. There was also a file renamed, but that is all fine. So now we know which files to remove we can start utilizing git filter-repo.

$ git filter-repo --invert-paths --path .env
Parsed 197 commits
New history written in 0.11 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
Enumerating objects: 210, done.
Counting objects: 100% (210/210), done.
Delta compression using up to 12 threads
Compressing objects: 100% (127/127), done.
Writing objects: 100% (210/210), done.
Building bitmaps: 100% (48/48), done.
Total 210 (delta 98), reused 144 (delta 75), pack-reused 0
Completely finished after 0.64 seconds.
$ git filter-repo --invert-paths --path data/my-video.mp4
Parsed 197 commits
New history written in 0.09 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
Enumerating objects: 210, done.
Counting objects: 100% (210/210), done.
Delta compression using up to 12 threads
Compressing objects: 100% (127/127), done.
Writing objects: 100% (210/210), done.
Building bitmaps: 100% (48/48), done.
Total 210 (delta 98), reused 144 (delta 75), pack-reused 0
Completely finished after 0.64 seconds.
$ git filter-repo --invert-paths --path data/my-other-video.mp4
Parsed 197 commits
New history written in 0.10 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
Enumerating objects: 210, done.
Counting objects: 100% (210/210), done.
Delta compression using up to 12 threads
Compressing objects: 100% (127/127), done.
Writing objects: 100% (210/210), done.
Building bitmaps: 100% (48/48), done.
Total 210 (delta 98), reused 144 (delta 75), pack-reused 0
Completely finished after 0.64 seconds.

Now we have fully rewritten our repository and got rid of the files we accidentally committed. Now there are 2 steps left. Pushing the repository and informing your teammember to make a fresh clone of the repository.

Yes this time it will be fast as the big files are gone!

git push --no-verify --mirror
Enumerating objects: 199, done.
Writing objects: 100% (199/199), 266.73 KiB | 3.03 MiB/s, done.
Total 199 (delta 0), reused 0 (delta 0), pack-reused 199
remote: Resolving deltas: 100% (96/96), done.
To github.com:marcofranssen/my-repo-to-be-rewritten.git
 + 0b09ac1...3c843c0 feature/ci -> feature/ci (forced update)
 + 175089f...7fa3dfb master -> master (forced update)
 * [new reference]   refs/replace/0362424174271cda267832b0ef3e78dfeb59ec72 -> refs/replace/0362424174271cda267832b0ef3e78dfeb59ec72
 * [new reference]   refs/replace/22a5d9613f51821b08d54e1d96cdae7c7b15652a -> refs/replace/22a5d9613f51821b08d54e1d96cdae7c7b15652a
 * [new reference]   refs/replace/e09dd373762b51f361d0033326982db62e3a524f -> refs/replace/e09dd373762b51f361d0033326982db62e3a524f
 * [new reference]   refs/replace/f1831654ceb3cad707d9d3f579a6aac14144e139 -> refs/replace/f1831654ceb3cad707d9d3f579a6aac14144e139
 ! [remote rejected] refs/pull/1/head -> refs/pull/1/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/2/head -> refs/pull/2/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/2/merge -> refs/pull/2/merge (deny updating a hidden ref)
 ! [remote rejected] refs/pull/3/head -> refs/pull/3/head (deny updating a hidden ref)
error: failed to push some refs to 'github.com:marcofranssen/my-repo-to-be-rewritten.git'

You can ignore the deny updating a hidden ref messages. Those are related to Github Pull Requests. If you check the repository on Github you will figure the new history is on Github and when cloning you will notice it clones much faster due to the huge files being removed entirely.

Bonus

To make things a bit easier I have also added an alias to my .gitconfig which allows me to easily find removed files from my Git history using a simpler commandline that is easier to remember.

git config --global alias.deleted "log --diff-filter=D --summary"

This allows me to type git deleted as opposed to git log --diff-filter=D --summary.

You can find my entire .gitconfig, containing a whole bunch of aliases in my dotfiles Github repository. These aliases save me a lot of typing and speed up my development process.

References

Thanks a lot for reading my blog. Please raise some awereness about this feature so others also have a less error prone way to cleanup their Git history.

You have disabled cookies. To leave me a comment please allow cookies at functionality level.

More Stories

Cover Image for How to do Enums in Go

How to do Enums in Go

MF

Marco Franssen /

It has been a while since I wrote a blog on Go. Since I'm getting the question if Go supports enums every now and then, I thought it would be good to write an article on how to do enums in Go. Go natively does NOT have an enum type like you might be used to from c# or Java. However that doesn't mean we can easily define our own type. In this blog we will cover defining our own type, combined with a piece of code generation. If you are new to Go, then consider reading Start on your first Go pro…

Cover Image for Hello Next.js, goodbye Hexo

Hello Next.js, goodbye Hexo

MF

Marco Franssen /

For the folks reading my blog for a long time, you might have noticed I'm using my current theme and blogging engine for a long time. About 5 years ago I migrated from Wordpress to Hexo. Wordpress at that point in time was costing me serious money to get a decent performing webpage according to modern standards. So back then I decided to move into a statically generated blog, where I could write my blogs offline using markdown. Hexo has served me very well the last couple of years. It is a stat…

Cover Image for Nginx 1.19 supports environment variables and templates in Docker

Nginx 1.19 supports environment variables and templates in Docker

MF

Marco Franssen /

In this blog I want to show you a nice new feature in Nginx 1.19 Docker image. I requested it somewhere 2 years ago when I was trying to figure out how I could configure my static page applications more flexibly with various endpoints to backing microservices. Back then I used to have my static pages fetch a json file that contained the endpoints for the apis. This way I could simply mount this json file into my container with all kind of endpoints for this particular deployment. It was some sor…

Cover Image for Building a Elasticsearch cluster using Docker-Compose and Traefik

Building a Elasticsearch cluster using Docker-Compose and Traefik

MF

Marco Franssen /

In a previous blog I have written on setting up Elasticsearch in docker-compose.yml already. I have also shown you before how to setup Traefik 1.7 in docker-compose.yml. Today I want to show you how we can use Traefik to expose a loadbalanced endpoint on top of a Elasticsearch cluster. Simplify networking complexity while designing, deploying, and running applications. We will setup our cluster using docker-compose so we can easily run and cleanup this cluster from our laptop. Create a Elasti…