Git – How to Handle Large SVN History When Moving to Git

gitsvn

Edit: unlike some similar questions such as Moving a multi-GB SVN repo to Git
or
https://stackoverflow.com/questions/540535/managing-large-binary-files-with-git
My scenario doesn't involve several subprojects that can be easily converted into git submoduels, nor a few very large binary files that are well suited for git-annex. It is a single repository where the binaries are the test suite which tightly coupled to the main source code of the same revision, much like if they were compile time assets such as graphics.

I'm investigating switching an old medium/large sized (50 users, 60k revisions, 80Gb history, 2Gb working copy) code repository from svn. As the number of users have grown, there is a lot of churn in trunk, and features are often spread out on multiple commits making code review hard to do. Also without branching there is no way to "gate" bad code out, reviews can only be done after it is committed to trunk. I'm investigating alternatives. I was hoping we could move to git, but I'm having some problems.

The problem with the current repo as far as git goes is size. There is a lot of old cruft in there, and cleaning it with –filter-branch when converting to git can cut it down in size by an order of magnitude, to around 5-10GB. This is still too big. The biggest reason for the large repository size is that there are a lot of binary documents being inputs to tests. These files vary between .5mb and 30mb, and there are hundreds. They also have quite a lot of changes. I have looked at submodules, git-annex etc, but having the tests in a submodule feels wrong, as does having annex for many files for which you want full history.

So the distributed nature of git is really what's blocking me from adopting it. I don't really care about distributed, I just want the cheap branching and powerful merging features. Like I assume 99.9% of git users do, we will use a blessed, bare central repository.

I'm not sure I understand why each user has to have a full local history when using git? If the workflow isn't decentralized, what is that data doing on the users' disks? I know that in recent versions of git you can use a shallow clone with only recent history. My question is: is it viable to do this as the standard mode of operation for an entire team? Can git be configured to always be shallow so you can have a full history only centrally, but users by default only have 1000 revs of history? The option to that of course would be to just convert 1000 revs to git, and keep the svn repo for archeology. In that scenario however, we'd encounter the same problem again after the next several thousand revisions to the test documents.

What is a good best practice for using git with large repos containing many binary files that you do want history for? Most best practices and tutorials seem to avoid this case. They solve the problem of few huge binaries, or propose dropping the binaries entirely.
Is shallow cloning usable as a normal mode of operation or is it a "hack"?
Could submodules be used for code where you have a tight dependency between the main source revision and the submodule revision (such as in compile time binary dependencies, or a unit test suite)?
How big is "too big" for a git repository (on premises)? Should we avoid switching if we can get it down to 4GB? 2GB?

Best Answer

Wow, that's a long question (and a complex problem). I'll try to have a go at it.

I'm not sure I understand why each user has to have a full local history when using git?

This is a central design decision with git. For the exact reasons you'd need to ask the author (Linus Torvalds), but as far as I know, the main reason is speed: Having everything local (on a fast disk or even cached in RAM) makes operations on history much faster by avoiding network access.

The biggest reason for the large repository size is that there are a lot of binary documents being inputs to tests. These files vary between .5mb and 30mb, and there are hundreds. They also have quite a lot of changes.

That is the point I would think about first. Having so many constantly changing binary files in source control seems problematic to me (even with SVN). Can't you use a different approach? Ideas:

Unlike source code, a 3 MB binary file is probably not written by hand. If some tool/process generates it, consider integrating that into your build, instead of storing the data.
If that is not practical, binary files are typically better off in an artifact repository (such as Artifactory for Maven & co.). Maybe that is an option for you.

I have looked at submodules, git-annex etc, but having the tests in a submodule feels wrong, as does having annex for many files for which you want full history.

Actually, this looks like git-annex would fit perfectly. git-annex basically allows you to store file contents outside a git repository (the repository contains a placeholder instead). You can store the file contents in a variety of ways (central git repo, shared drive, cloud storage...), and you can control which contents you want to have locally.

Did you maybe misunderstand how git-annex works? git-annex does store full history for all the files it manages - it just lets you choose which file contents you want to have locally.

Finally, about your questions:

What is a good best practice for using git with large repos containing many binary files that you do want history for?

In my experience, the options usually are:

avoid the need for binaries in the repo (generate them on demand, store them elsewhere)
use git-annex (or a similar solution, such as Git LFS)
live with a big repo (not all git operations are affected by big files, and if you have a fast computer and drive, it can be quite workable)

Is shallow cloning usable as a normal mode of operation or is it a "hack"?

That might be doable; however, I don't think this will solve your problem:

you'd lose of git's benefits that come from having full history, such as quick searching of the history
merges can become tricky, because AKAIK you must have at least the history back to the branch point to merge
users would need to re-clone periodically to keep the size of their clone small
it's just an uncommon way of using git, so you'd likely run into problems with many tools

How big is "too big" for a git repository (on premises)? Should we avoid switching if we can get it down to 4GB? 2GB?

That depends on the structure of the repo (few/many files etc.), on what you want to do, on how beefy your computers are, and on your patience :-).

To give you a quick idea: On my (newish, but low-spec) laptop, committing a 500 MB file takes 30-60s. Just listing history (git log etc.) is not affected by big files; things like "git log -S" which must scan file content are very slow - however, the speed is mainly dominated by I/O, so it's not really git's fault.

On a 3 GB repo with a handful of revisions, "git log -S" takes about a minute.

So I'd say a couple of GB is ok, though not ideal. More than 10-20 GB is probably pushing it, but it might be doable - you'd have to try it.

Best Answer

Related Solutions

Git – How to Clone an SVN Repository with Changed Layouts and Maintain Full History

Version Control – How to Move a Multi-GB SVN Repo to Git

Related Topic