SCM – How to Split a Repository with Multiple Projects Sharing the Same Build System

gitmercurialscmsvn

I've been working on the past years on a research compiler suite, which builds several executables and libraries. It has a build system (namely bootstrapper) that looks into the ./src/bin directory searching for tools to build, and on ./src/lib, ./src/rtl and ./src/std for libraries, managing the dependencies (only internal dependencies!). Works pretty much like a single makefile for multiple tools (like GCC does). Because of this, they share the same codebase. I've been using a single git repository so far.

Since most tools and libraries are self-contained, I'd like to split the repository into several modules, since most tools don't depend on each other (there are no cyclical dependencies at all, only a couple of hierarchical ones!).

For example, I'd have a core module with the very basic system, which would look something like this:

./README
./src/bootstrapper.krc
./src/bin/cpp.c
./src/bin/cc.c
./src/lib/cpp/...
./src/lib/cc/...
./src/lib/foo/...
./src/lib/bar/...

Which contains the builder, and that is self-contained; it should generate cpp, cc, and the libcpp, libcc, libfoo and libbar libraries.

I'd like then to have, for example, a ruby module separated someway, that would look something like:

./src/bin/rbc.c
./src/lib/rbc/...
./src/std/ruby/...

Which would build rbc, librbc and libstdruby… and then a python module like:

./src/bin/pyc.c
./src/lib/pyc/...
./src/std/python/...

Which would build pyc, libpyc, and libstdpython.

Though both the ruby and python hypotetical modules would depend on the core module, they have no dependency at all on each other (and this is why I'm thinking about splitting my current repository).

Which way could I achieve this? I thought about using different git branches for each module (but I would require several hooks to automatically rebase everything based on their dependencies, like having multiple copies of the core module and keeping them up to date), multiple git repositories (same problem), and git submodules won't help because I'd need a per-file-basis because of the build system I have…

(Though I prefer git, because I'd like to publish it on GitHub, I don't mind switching to another SCM.)

All tools (bin/example) depend on the same ./include/bin/main.h (which would be part of the core module) that has a lot of macros… each folder on ./src/ actually has a corresponding ./include/ folder too (e.g., ./include/std/ruby/...).

Here's a pic of somewhat what I described above, as it is currently:

Best Answer

The requirements
Following points are important to be considered when you are thinking of spliting the repository.

  1. Each new split has it's own trunk now. Each product cann't just be another branch in this case, because now each of these products will have their own versions (tags), their own dev branches and hence they all must have their own trunks respectively.

  2. Maintain history This is perhaps the most important thing when splitting the repo. We shouldn't loose history of things interacted in past. For example you can take the complete export out of the source and create four fresh repos. But that will all loose history. So each of the part of the product repo should have it's history when we now use it from it's independent repo.

  3. Open working branches If there are feature branches or support branches out there, and if we want to shut shops, we will have to wind up all and put it back to trunk. However, for any reasonably sized project this is hard to materialize. Rather, the branches with same history and snapshots should be available in the new repo so that work can resume where it got suspended not require to restart!

But the solution can be simple:

  1. First step is to clone each repo with a different name. The procedure for each repo type is listed below. Basically with four clones now you have four repos, each having the same identical structure (trunk/tags/branches), identical history and identical set of branches.

  2. Now, from the primary trunk you can simply delete all source codes which doesn't require in the corresponding repo. For example in one repo delete all ruby files and keep only C files. where as do the reverse in the other repo. So now you will have one repo with libary produced from C files and another which has ruby files.

  3. Merge this change to all subordinate branches to reflect deletion from there as well.

  4. Modify the common files such as Makefile etc. which will now have the same structure.

  5. Optionally you might want to "move" files to re-organize the structure.

  6. Tag the new release with new number series 2.0 (if it was 1.x before).

Following this, all your history will remain available to each of the 4 product repos. And hence, critically, you will be able to go back to the snapshot which worked togather.

Repo specific stuff

  1. SVN For SVN this is extremely simple. Simply copy paste the main folder where the repo is hosted. The one that was generated with svnadmin --create command (not the working copy). See the SVN book Ch 5 for more details. Once you clone it, it has everything that other repo has.

  2. Git repos are decentralize by default. So everytime you do git-clone there itself you are actually cloning the repos. However, in order to be identified with different products you have 2 other options. A. Forking and B. Mirroring. Refer to this page for duplicating without mirroring. One more reference is this SO question. Forking is another option. Refer to this page for the basics. However forking is not exactly same as cloning/mirroring. I would prefer mirroring.