SCM – How to Split a Repository with Multiple Projects Sharing the Same Build System

gitmercurialscmsvn

I've been working on the past years on a research compiler suite, which builds several executables and libraries. It has a build system (namely bootstrapper) that looks into the ./src/bin directory searching for tools to build, and on ./src/lib, ./src/rtl and ./src/std for libraries, managing the dependencies (only internal dependencies!). Works pretty much like a single makefile for multiple tools (like GCC does). Because of this, they share the same codebase. I've been using a single git repository so far.

Since most tools and libraries are self-contained, I'd like to split the repository into several modules, since most tools don't depend on each other (there are no cyclical dependencies at all, only a couple of hierarchical ones!).

For example, I'd have a core module with the very basic system, which would look something like this:

./README
./src/bootstrapper.krc
./src/bin/cpp.c
./src/bin/cc.c
./src/lib/cpp/...
./src/lib/cc/...
./src/lib/foo/...
./src/lib/bar/...

Which contains the builder, and that is self-contained; it should generate cpp, cc, and the libcpp, libcc, libfoo and libbar libraries.

I'd like then to have, for example, a ruby module separated someway, that would look something like:

./src/bin/rbc.c
./src/lib/rbc/...
./src/std/ruby/...

Which would build rbc, librbc and libstdruby… and then a python module like:

./src/bin/pyc.c
./src/lib/pyc/...
./src/std/python/...

Which would build pyc, libpyc, and libstdpython.

Though both the ruby and python hypotetical modules would depend on the core module, they have no dependency at all on each other (and this is why I'm thinking about splitting my current repository).

Which way could I achieve this? I thought about using different git branches for each module (but I would require several hooks to automatically rebase everything based on their dependencies, like having multiple copies of the core module and keeping them up to date), multiple git repositories (same problem), and git submodules won't help because I'd need a per-file-basis because of the build system I have…

(Though I prefer git, because I'd like to publish it on GitHub, I don't mind switching to another SCM.)

All tools (bin/example) depend on the same ./include/bin/main.h (which would be part of the core module) that has a lot of macros… each folder on ./src/ actually has a corresponding ./include/ folder too (e.g., ./include/std/ruby/...).

Here's a pic of somewhat what I described above, as it is currently:

Best Answer

The requirements
Following points are important to be considered when you are thinking of spliting the repository.

Each new split has it's own trunk now. Each product cann't just be another branch in this case, because now each of these products will have their own versions (tags), their own dev branches and hence they all must have their own trunks respectively.
Maintain history This is perhaps the most important thing when splitting the repo. We shouldn't loose history of things interacted in past. For example you can take the complete export out of the source and create four fresh repos. But that will all loose history. So each of the part of the product repo should have it's history when we now use it from it's independent repo.
Open working branches If there are feature branches or support branches out there, and if we want to shut shops, we will have to wind up all and put it back to trunk. However, for any reasonably sized project this is hard to materialize. Rather, the branches with same history and snapshots should be available in the new repo so that work can resume where it got suspended not require to restart!

But the solution can be simple:

First step is to clone each repo with a different name. The procedure for each repo type is listed below. Basically with four clones now you have four repos, each having the same identical structure (trunk/tags/branches), identical history and identical set of branches.
Now, from the primary trunk you can simply delete all source codes which doesn't require in the corresponding repo. For example in one repo delete all ruby files and keep only C files. where as do the reverse in the other repo. So now you will have one repo with libary produced from C files and another which has ruby files.
Merge this change to all subordinate branches to reflect deletion from there as well.
Modify the common files such as Makefile etc. which will now have the same structure.
Optionally you might want to "move" files to re-organize the structure.
Tag the new release with new number series 2.0 (if it was 1.x before).

Following this, all your history will remain available to each of the 4 product repos. And hence, critically, you will be able to go back to the snapshot which worked togather.

Repo specific stuff

SVN For SVN this is extremely simple. Simply copy paste the main folder where the repo is hosted. The one that was generated with svnadmin --create command (not the working copy). See the SVN book Ch 5 for more details. Once you clone it, it has everything that other repo has.
Git repos are decentralize by default. So everytime you do git-clone there itself you are actually cloning the repos. However, in order to be identified with different products you have 2 other options. A. Forking and B. Mirroring. Refer to this page for duplicating without mirroring. One more reference is this SO question. Forking is another option. Refer to this page for the basics. However forking is not exactly same as cloning/mirroring. I would prefer mirroring.

Related Solutions

Git – Organizing Git repositories with common nested sub-modules

I'm very late to this party, but your question still doesn't seem to have a complete answer, and it's a pretty prominent hit from google.

I have the exact same problem with C++/CMake/Git/Submodules and I have a similar problem with MATLAB/Git/Submodules, which gets some extra weirdness because MATLAB isn't compiled. I came across this video recently, which seems to propose a "solution". I don't like the solution, because it essentially means throwing away submodules, but it does eliminate the problem. It is just as @errordeveloper recommends. Each project has no submodules. To build a project, create a super-project to build it, and include it as a sibling to its dependencies.

So your project for developing graph might look like:

buildgraph/graph
buildgraph/core

and then your project for studio could be:

buildstudio/studio
buildstudio/graph
buildstudio/network
buildstudio/core

The super-projects are just a main CMakeLists.txt and a bunch of submodules. But none of the projects have any submodules themselves.

The only cost I see to this approach is the proliferation of trivial "super-projects" that are just dedicated to building your real projects. And if someone gets a hold of one of your projects, there is no easy way to tell without finding the super-project as well, what its dependencies are. That might make it sit really ugly on Github, for example.

SCM – How to Search in a Repository Using Git or Mercurial

In git, that's one of the central features, called pickaxe:

git log -Stext
git log -Gregexp

That will look up revisions that add or remove matching text anywhere. You can of course combine it with path filters and revision ranges and any other options.

Best Answer

Related Solutions

Git – Organizing Git repositories with common nested sub-modules

SCM – How to Search in a Repository Using Git or Mercurial

Related Topic