At my company, we use a separate SVN repository for every component of the system. I can tell you that it gets extremely frustrating. Our build process has so many layers of abstraction.
We do this with Java, so we have a heavy build process with javac compilation, JibX binding compilation, XML validation, etc.
For your site, it may not be a big deal if you don't really "build it" (such as vanilla PHP).
Downsides to splitting a product into multiple repositories
- Build management - I can't just checkout code, run a self-contained build script and have a runnable / installable / deployable product. I need an external build system that goes out to multiple repos, runs multiple inner build scripts, then assembles the artifacts.
- Change tracking - Seeing who changed what, when, and why. If a bug fix in the frontend requires a backend change, there are now 2 divergent paths for me to refer back to later.
- Administration - do you really want to double the number of user accounts, password policies, etc. that need to be managed?
- Merging - New features are likely to change a lot of code. By splitting your project into multiple repositories, you are multiplying the number of merges needed.
- Branch creation - Same deal with branching, to create a branch, you now have to create a branch in each repository.
- Tagging - after a successful test of your code, you want to tag a version for release. Now you have multiple tags to create, one in each repository.
- Hard to find something - Maybe frontend/backend is straightforward, but it becomes a slippery slope. If you split into enough modules, developers may have to investigate where some piece of code lives in source control.
My case is a bit extreme as our product is split across 14 different repos and each repo is then divided into 4-8 modules. If I remember, we have somewhere around 80 or some "packages" which all need to be checked out individually and then assembled.
Your case with just backend/frontend may be less complicated, but I still advise against it.
Extreme examples can be compelling arguments for or against pretty much anything :)
Criteria I would use to decide
I would consider splitting a product into multiple source code repositories after considering the following factors:
- Build - Do the results of building each component merge together to form a product? Like combining .class files from a bunch of components into a series of .jar or .war files.
- Deployment - Do you end up with components that get deployed together as one unit or different units that go to different servers? For example, database scripts go to your DB server, while javascript goes to your web server.
- Co-change - Do they tend to change frequently or together? In your case, they may change separately, but still frequently.
- Frequency of branching/merging - if everybody checks into trunk and branches are rare, you may be able to get away with it. If you frequently branch and merge, this may turn into a nightmare.
- Agility - if you need to develop, test, release and deploy a change on a moment's notice (likely with SaaS), can you do it without spending precious time juggling branches and repos?
Your arguments
I also don't agree with most of your arguments for this splitting. I won't dispute them all because this long answer will get even longer, but a few that stand out:
We have two modules that don't depend between each others.
Non-sense. If you take your backend away, will your frontend work? That's what I thought.
Having source history of both projects in the long term may complicate
things (try searching in the history for something in the frontend
while you have half of the commits that are completely unrelated to
the bug you're looking for)
If your project root is broken into frontend/ and backend/, then you can look at the history of those hierarchies independently.
Conflict and merging (This shouldn't happen but having someone pushing
to the backend will force other developer to pull backend changes to
push frontend changes.) One developer might work only on the backend
but will always have to pull the backend or the other way around.
Splitting your project into different repos doesn't solve this. A frontend conflict and a backend conflict still leaves you with 2 conflicts, whether it's 1 repository times 2 conflicts or 2 repositories times 1 conflict. Somebody still needs to resolve them.
If the concern is that 2 repos means a frontend dev can merge frontend code while a backend dev merges backend code, you can still do that with a single repository using SVN. SVN can merge at any level. Maybe that is a git or mercurial limitation (you tagged both, so not sure what SCM you use)?
On the other hand
With all this said, I have seen cases where splitting a project into multiple modules or repositories works. I even advocated for it once for a particular project where we integrated Solr into our product. Solr of course runs on separate servers, only changes when a changeset is related to search (our product does much more than search), has a separate build process and there are no code artifacts or build artifacts shared.
Wow, that's a long question (and a complex problem). I'll try to have a
go at it.
I'm not sure I understand why each user has to have a full local
history when using git?
This is a central design decision with git. For the exact reasons you'd
need to ask the author (Linus Torvalds), but as far as I know, the main
reason is speed: Having everything local (on a fast disk or even cached
in RAM) makes operations on history much faster by avoiding network
access.
The biggest reason for the large repository size is that there are a
lot of binary documents being inputs to tests. These files vary
between .5mb and 30mb, and there are hundreds. They also have quite a
lot of changes.
That is the point I would think about first. Having so many constantly
changing binary files in source control seems problematic to me (even
with SVN). Can't you use a different approach? Ideas:
Unlike source code, a 3 MB binary file is probably not written by
hand. If some tool/process generates it, consider integrating that
into your build, instead of storing the data.
If that is not practical, binary files are typically better off in an
artifact repository (such as Artifactory for Maven & co.). Maybe that
is an option for you.
I have looked at submodules, git-annex etc, but having
the tests in a submodule feels wrong, as does having annex for many
files for which you want full history.
Actually, this looks like git-annex would fit perfectly. git-annex
basically allows you to store file contents outside a git repository
(the repository contains a placeholder instead). You can store the file
contents in a variety of ways (central git repo, shared drive, cloud storage...), and you can control which contents you want to have locally.
Did you maybe misunderstand how git-annex works? git-annex does store
full history for all the files it manages - it just lets you choose
which file contents you want to have locally.
Finally, about your questions:
What is a good best practice for using git with large repos containing
many binary files that you do want history for?
In my experience, the options usually are:
- avoid the need for binaries in the repo (generate them on demand,
store them elsewhere)
- use git-annex (or a similar solution, such as Git LFS)
- live with a big repo (not all git operations are affected by big
files, and if you have a fast computer and drive, it can be quite
workable)
Is shallow cloning usable as a normal mode of operation or is it a
"hack"?
That might be doable; however, I don't think this will solve your
problem:
- you'd lose of git's benefits that come from having full history, such
as quick searching of the history
- merges can become tricky, because AKAIK you must have at least the
history back to the branch point to merge
- users would need to re-clone periodically to keep the size of their
clone small
- it's just an uncommon way of using git, so you'd likely run into
problems with many tools
How big is "too big" for a git repository (on premises)? Should we
avoid switching if we can get it down to 4GB? 2GB?
That depends on the structure of the repo (few/many files etc.), on what
you want to do, on how beefy your computers are, and on your patience
:-).
To give you a quick idea: On my (newish, but low-spec) laptop,
committing a 500 MB file takes 30-60s. Just listing history (git log
etc.) is not affected by big files; things like "git log -S" which must
scan file content are very slow - however, the speed is mainly dominated
by I/O, so it's not really git's fault.
On a 3 GB repo with a handful of revisions, "git log -S" takes about a
minute.
So I'd say a couple of GB is ok, though not ideal. More than 10-20 GB is
probably pushing it, but it might be doable - you'd have to try it.
Best Answer
Git submodules are broken:
(Previous link was this one but went down unfortunately)
We have been using them for something similar as you describe, and it is hell now. You modify a piece of code, and you're not sure whether it will be the easy-to-commit-one, or the one that needs some acrobatics to push to the submodule, then update the reference in the parent repo and then push it. From time to time each of us spends hours trying to figure out what's happening and why:
a) updating submodules (
--recursive
?--remote
?) doesn't updateb) git complains about changes, but commiting them doesn't commit them
c) you are stuck on a branch, unable to do anything until you resolve changes in submodules (stashing doesn't work, nor other tricks to ignore changes)
Now, I admit - none of us is a git expert, and perhaps we do something wrong from time to time. However, most of us have no problem using advanced git features like rebase and we still cannot wrap our heads around "this submodules thing".
I will not even continue my rant about what happens when you have submodules within submodules...
What seems a better solution is to create an internal package manager (in our case we have a NuGet server in our company).
To be honest: There are projects when submodules seem to work well, for example Qt.