Many of you have probably been in a situation where you committed a file in your repository which you shouldn’t have done in the first place. For example a file with credentials or a crazy big file that made your repository clones very slow. Now there are a lot of blogs and guides already available on how to get these files completely removed. It involves git filter-branch
or bfg
sourcery. In this blog I’m going to show you the new recommended way of doing this using git-filter-repo, which simplifies the process a lot.
Recently I had to rewrite a repository that slowed down the CI pipeline due to some huge movies committed. In later commits those movies where removed again, but in Git they still exist and cause big slow repositories. In the past I have also committed some credential that I had to remove. Trust me, we have all been there. To get the files removed, I started with the git filter-branch
approach I am used to for years. Then I noticed following message.
1 | $ git filter-branch --index-filter 'git rm --cached --ignore-unmatch my-huge-video.mp4' HEAD |
Although the Git cli is giving us this warning there is not so much written on
git-filter-repo
yet. Also the documentation on Github is still referring tofilter-branch
. A reason for me to write and bring awareness of this tool.
After reading a bit on git-filter-repo I figured it is there for trivial rewriting usecases like removing a file entirely from history. This simplifies how you can remove a file entirely from your repository as you will have a simpler command at your fingertips as well you won’t have to run things like BFG for the final cleanups on the remote. For other usecases you can still use the more advanced features of git filter-branch
, but for simple rewrites git-filter-repo
is the recommended way of handling these kind of things today.
So how do we get started? We will first have to install git-filter-repo
. This is as simple as running your package manager install command according to the documentation. Below how I did this on my Macbook utilizing Homebrew.
1 | brew install git-filter-repo |
Now with the tooling installed I simply started over my process of removing the files entirely from the repository. But before showing this specific step in the process I want to guide you through the process from the beginning.
I’m not responsible for any mistakes or loss of data when performing this process on your own. To become a master at any skill, it takes the total effort of your: heart, mind, and soul.
We start by cloning the repository using the --mirror
option which takes care of pulling all branch information etc. locally.
Cloning via
--mirror
does not give you a workspace to work locally.
1 | $ git clone --mirror git@github.com:marcofranssen/my-repo-to-be-rewritten.git |
Next up I want to find the files that have been deleted in the past as I don’t remember all of them. With the following command I can easily find which files have been removed in previous commits.
1 | $ git log --diff-filter=D --summary |
What we can see from above output is that I removed a .env
file that apperantly contained some credentials, and I have removed some processed video’s from the repository. There was also a file renamed, but that is all fine. So now we know which files to remove we can start utilizing git filter-repo
.
1 | $ git filter-repo --invert-paths --path .env |
Now we have fully rewritten our repository and got rid of the files we accidentally committed. Now there are 2 steps left. Pushing the repository and informing your teammember to make a fresh clone of the repository.
Yes this time it will be fast as the big files are gone!
1 | git push --no-verify --mirror |
You can ignore the deny updating a hidden ref messages. Those are related to Github Pull Requests. If you check the repository on Github you will figure the new history is on Github and when cloning you will notice it clones much faster due to the huge files being removed entirely.
Bonus
To make things a bit easier I have also added an alias to my .gitconfig
which allows me to easily find removed files from my Git history using a simpler commandline that is easier to remember.
1 | git config --global alias.deleted "log --diff-filter=D --summary" |
This allows me to type git deleted
as opposed to git log --diff-filter=D --summary
.
You can find my entire .gitconfig
, containing a whole bunch of aliases in my dotfiles Github repository. These aliases save me a lot of typing and speed up my development process.
References
Thanks a lot for reading my blog. Please raise some awereness about this feature so others also have a less error prone way to cleanup their Git history.