Transferring legacy code base from cvs to distributed repository (e.g. git or mercurial). Suggestions needed for initial repository design [closed]

Just a quick comment to remind you that:

  • those migrations often offer the opportunity to reorganize the sources, not along modules (each with one repositories) but rather along a functional domain split (several modules for a same given functional domain being put in the same repository).

Then submodules are to be used, as a way to define a configuration.

[…] CVS, ie it really ends up being pretty much oriented to a “one file
at a time” model.

Which is nice in that you can have a million files, and then only check
out a few of them – you’ll never even see the impact of the other
999,995 files.

Git
fundamentally never really looks at less than the whole repo. Even if you
limit things a bit (ie check out just a portion, or have the history go
back just a bit), git ends up still always caring about the whole thing,
and carrying the knowledge around.

So git scales really badly if you force it to look at everything as one
huge repository. I don’t think that part is really fixable, although we
can probably improve on it.

And yes, then there’s the “big file” issues. I really don’t know what to
do about huge files. We suck at them, I know.


Those two aforementioned points advocate for a more component-oriented approach for large system (and large legacy repository).

With Git submodule, you can checkout them in your project (even if it is a two-steps process). You have however tools than can make the submodule management easier (git.rake for instance).


When I’m thinking of fixing a bug in a module that’s shared between several projects, I just fix the bug and commit it and all just do their updates

That is what I describe in the post Vendor Branch as the “system approach”: everyone works on the latest (HEAD) of everything, and it is effective for small number of projects.
For a large number of modules though, the notion of “module” is still very useful, but its management is not the same with DVCS:

  • for closely related modules (aka “in the same functional domain”, like “all modules related to PNL – Profit aNd Losses – or “Risk analysis”, in a financial domain), you do need to work with the latest (HEAD) of all components involved.
    That would be achieved with the use of a subtree strategy, not in order for you to publish (push) corrections on those other submodules, but to track works done by other teams.
    Git allows that with the extra-bonus that this “tracking” does not have to take place between your repository and one “central” repository, but can also take place between you and the local repository of the other team, allowing for a very quick back-and-forth integration and testing between projects of similar nature.

  • however, for modules which are not directly in your functional domain, submodules are a better option, because they refer to a fix version of a module (a commit):
    when a low-level framework changes, you do not want it to be propagated instantaneously, since it would impact all the other teams, which would then have to drop what they were doing to adapt their code to that new version (you do want though all the other teams to be aware of this new version, in order for them to not forget to update that low-level component or “module”).
    That allows you to work only with official stable identified versions of other modules, and not potentially un-stabled or not fully tested HEADs.

Leave a Comment