Monday, January 5, 2015

Versioning Is A Nuanced Social Fiction; SemVer Is A Blunt Instrument

David Heinemeier-Hansson said something relatively lucid and wise on Twitter recently:

To his credit, he also realized that somebody else had already said it better.

Here's the nub and the gist of Jeremy Ashkenas's Gist:

SemVer tries to compress a huge amount of information — the nature of the change, the percentage of users that will be affected by the change, the severity of the change (Is it easy to fix my code? Or do I have to rewrite everything?) — into a single number. And unsurprisingly, it's impossible for that single number to contain enough meaningful information...

Ultimately, SemVer is a false promise that appeals to many developers — the promise of pain-free, don't-have-to-think-about-it, updates to dependencies. But it simply isn't true.

It's extremely worthwhile to read the whole thing.

Here's how I see version numbers: they predate Git, and Git makes version numbers pretty stupid if you take those numbers literally, because we now use hashes like 64f2a2451381c80dff1 to identify specific versions of our code bases. Strictly speaking, version numbers are fictional. If you really want to know what version you're looking at, the answer to that question is not a number at all, but a Git hash.

But we still use version numbers. We do this for the same reason that, even if we one day replace every car on the road with an error-proof robot which is only capable of perfect driving, we will still have speed limits, brake lights, and traffic signs. It's the same reason there's an urban legend that the width of the Space Shuttle ultimately derives from the width of roads in Imperial Rome: systems often outlive their original purposes.

Version numbers were originally used to identify specific versions of a code base, but that hasn't been strictly accurate since the invention of version control systems, whose history goes back at least 43 years, to 1972. As version control systems became more and more fine-grained, version numbers diverged further and further from the actual identifiers we use to index our versioning systems, and thus "version numbers" became more and more a social fiction.

Note that this is not necessarily a bad thing. Money is a social fiction, and an incredibly useful one. But SemVer is an attempt to treat the complexities of a social fiction as if they were very deterministic and controlled.

They are not.

Which means SemVer is an attempt to brutally oversimplify an inherently complex problem.

There's a lot of good commentary on these complexities. Justin Searls gave a very good presentation which goes into why these problems are inherently complex, and inherently social.

I'm not saying that I don't think SemVer's goals are important. But I do think SemVer's a clumsy replacement for nuanced versioning, and an incomplete answer for "how do we demarcate incompatibility risks in systems made up of extremely numerous libraries written by extremely numerous people?"

Because version numbers are a social fiction, entirely distinct from the "numbers" we use to actually version our software in modern version control systems, choosing new version numbers is primarily a matter of communicating with your users. Like all communication, it is inherently complex and nuanced. If it is possible at all to reliably automate the communication of nuance, the medium of communication will probably not be a trio of numbers, because the problem space simply has far more dimensions than three.

But for the same reason, I kind of think version numbers verge on ridiculous whether they're trying to color within the SemVer lines or not. There's only so much weight you can expect a social fiction to carry before it cracks at the seams and falls apart. Even the idea of a canonical repo is a little silly in my opinion.

You can see why the canonical repo is a mistake if you look at a common antipattern on GitHub: a project is abandoned, but development continues within multiple forks of the project. Which repo is now canonical? You have to examine each fork, and discover how well it keeps up with the overall, now-decentralized progress of the project. You'll often find that Fork A does a better job with updates related to one aspect of the project, while Fork B does a better job with updates related to another aspect. And it's a manual process; no GitHub view exists which will make it particularly easy for you to determine which of the still-in-progress forks are continuing ahead of the "canonical" repo.

At the very least, in a situation like this, you have to differentiate between the original repo and the canonical one. I think that much is indisputable. But I'd argue also that the basic idea of a canonical repo operates in defiance of the entire history of human language. In fact, rumor has it that GitHub itself runs on a private fork of Rails 2, which illustrates my point perfectly, by constituting a local dialect.

(Update: GitHub ran on a private fork of Rails 2 for many years, but moved to Rails 3 in September 2014. Thanks to Florian Gilcher for the details.)

I'd like to see some anthropologists and linguists research our industry, because the modern dev world, with its countless and intricately interwoven dependencies, presents some really complex and subtle problems.