Many of you have probably been in a situation where you committed a file in your repository which you shouldn’t have done in the first place. For example a file with credentials or a crazy big file that made your repository clones very slow. Now there are a lot of blogs and guides already available on how to get these files completely removed. It involves
git filter-branch or
bfg sourcery. In this blog I’m going to show you the new recommended way of doing this using git-filter-repo, which simplifies the process a lot.
Recently I had to rewrite a repository that slowed down the CI pipeline due to some huge movies committed. In later commits those movies where removed again, but in Git they still exist and cause big slow repositories. In the past I have also committed some credential that I had to remove. Trust me, we have all been there. To get the files removed, I started with the
git filter-branch approach I am used to for years. Then I noticed following message.
$ git filter-branch --index-filter 'git rm --cached --ignore-unmatch my-huge-video.mp4' HEAD
Although the Git cli is giving us this warning there is not so much written on
git-filter-repoyet. Also the documentation on Github is still referring to
filter-branch. A reason for me to write and bring awareness of this tool.
After reading a bit on git-filter-repo I figured it is there for trivial rewriting usecases like removing a file entirely from history. This simplifies how you can remove a file entirely from your repository as you will have a simpler command at your fingertips as well you won’t have to run things like BFG for the final cleanups on the remote. For other usecases you can still use the more advanced features of
git filter-branch, but for simple rewrites
git-filter-repo is the recommended way of handling these kind of things today.
So how do we get started? We will first have to install
git-filter-repo. This is as simple as running your package manager install command according to the documentation. Below how I did this on my Macbook utilizing Homebrew.
brew install git-filter-repo
Now with the tooling installed I simply started over my process of removing the files entirely from the repository. But before showing this specific step in the process I want to guide you through the process from the beginning.
I’m not responsible for any mistakes or loss of data when performing this process on your own. To become a master at any skill, it takes the total effort of your: heart, mind, and soul.
We start by cloning the repository using the
--mirror option which takes care of pulling all branch information etc. locally.
--mirrordoes not give you a workspace to work locally.
$ git clone --mirror firstname.lastname@example.org:marcofranssen/my-repo-to-be-rewritten.git
Next up I want to find the files that have been deleted in the past as I don’t remember all of them. With the following command I can easily find which files have been removed in previous commits.
$ git log --diff-filter=D --summary
What we can see from above output is that I removed a
.env file that apperantly contained some credentials, and I have removed some processed video’s from the repository. There was also a file renamed, but that is all fine. So now we know which files to remove we can start utilizing
$ git filter-repo --invert-paths --path .env
Now we have fully rewritten our repository and got rid of the files we accidentally committed. Now there are 2 steps left. Pushing the repository and informing your teammember to make a fresh clone of the repository.
Yes this time it will be fast as the big files are gone!
git push --no-verify --mirror
You can ignore the deny updating a hidden ref messages. Those are related to Github Pull Requests. If you check the repository on Github you will figure the new history is on Github and when cloning you will notice it clones much faster due to the huge files being removed entirely.
To make things a bit easier I have also added an alias to my
.gitconfig which allows me to easily find removed files from my Git history using a simpler commandline that is easier to remember.
git config --global alias.deleted "log --diff-filter=D --summary"
This allows me to type
git deleted as opposed to
git log --diff-filter=D --summary.
You can find my entire
.gitconfig, containing a whole bunch of aliases in my dotfiles Github repository. These aliases save me a lot of typing and speed up my development process.
Thanks a lot for reading my blog. Please raise some awereness about this feature so others also have a less error prone way to cleanup their Git history.