bitbucket.org lets you create unlimited private repos.
Git does not let you check out just some pieces of code. So you would need to create a repo for each project, or deal with cloning all the projects. In reality I don't see a problem with putting all our small projects in a single repo. You clone it once and you are done.
With Git you don't ever have to "checkout" the code again unless you blow away your local repo or move to another machine. You'll just synchronize all your changes.
I have a simular issue with a large number of repositories. The reason I cannot store them all in a single repository is that I need to branch different versions off of each and every repository. It is very difficult to manage.
Wow, that's a long question (and a complex problem). I'll try to have a
go at it.
I'm not sure I understand why each user has to have a full local
history when using git?
This is a central design decision with git. For the exact reasons you'd
need to ask the author (Linus Torvalds), but as far as I know, the main
reason is speed: Having everything local (on a fast disk or even cached
in RAM) makes operations on history much faster by avoiding network
access.
The biggest reason for the large repository size is that there are a
lot of binary documents being inputs to tests. These files vary
between .5mb and 30mb, and there are hundreds. They also have quite a
lot of changes.
That is the point I would think about first. Having so many constantly
changing binary files in source control seems problematic to me (even
with SVN). Can't you use a different approach? Ideas:
Unlike source code, a 3 MB binary file is probably not written by
hand. If some tool/process generates it, consider integrating that
into your build, instead of storing the data.
If that is not practical, binary files are typically better off in an
artifact repository (such as Artifactory for Maven & co.). Maybe that
is an option for you.
I have looked at submodules, git-annex etc, but having
the tests in a submodule feels wrong, as does having annex for many
files for which you want full history.
Actually, this looks like git-annex would fit perfectly. git-annex
basically allows you to store file contents outside a git repository
(the repository contains a placeholder instead). You can store the file
contents in a variety of ways (central git repo, shared drive, cloud storage...), and you can control which contents you want to have locally.
Did you maybe misunderstand how git-annex works? git-annex does store
full history for all the files it manages - it just lets you choose
which file contents you want to have locally.
Finally, about your questions:
What is a good best practice for using git with large repos containing
many binary files that you do want history for?
In my experience, the options usually are:
- avoid the need for binaries in the repo (generate them on demand,
store them elsewhere)
- use git-annex (or a similar solution, such as Git LFS)
- live with a big repo (not all git operations are affected by big
files, and if you have a fast computer and drive, it can be quite
workable)
Is shallow cloning usable as a normal mode of operation or is it a
"hack"?
That might be doable; however, I don't think this will solve your
problem:
- you'd lose of git's benefits that come from having full history, such
as quick searching of the history
- merges can become tricky, because AKAIK you must have at least the
history back to the branch point to merge
- users would need to re-clone periodically to keep the size of their
clone small
- it's just an uncommon way of using git, so you'd likely run into
problems with many tools
How big is "too big" for a git repository (on premises)? Should we
avoid switching if we can get it down to 4GB? 2GB?
That depends on the structure of the repo (few/many files etc.), on what
you want to do, on how beefy your computers are, and on your patience
:-).
To give you a quick idea: On my (newish, but low-spec) laptop,
committing a 500 MB file takes 30-60s. Just listing history (git log
etc.) is not affected by big files; things like "git log -S" which must
scan file content are very slow - however, the speed is mainly dominated
by I/O, so it's not really git's fault.
On a 3 GB repo with a handful of revisions, "git log -S" takes about a
minute.
So I'd say a couple of GB is ok, though not ideal. More than 10-20 GB is
probably pushing it, but it might be doable - you'd have to try it.
Best Answer
Yes, that's exactly the beauty of DVCS such as git. You can use any number of different repos with the same state as the one on bitbucket or github.
Even you local copy (the repository on your computer) is usually a full clone of the remote repo.
The only thing you have to do to keep multiple repos in sync is pulling for one (usually called origin or upstream) and pushing to the backup copies.