Friday, January 30, 2015

ForwardJS Next Week: JSCK Talk, Free Tickets

I'll be giving a talk on JSCK at ForwardJS next week. And I have two free tickets to give away — first come, first served.

I'm looking forward to this conf a lot, mostly because there's a class on functional programming in JavaScript. I know there are skeptics, but I recently read Fogus's book about that and it was pretty great.

Tuesday, January 27, 2015

Superhero Comics About Modern Tech

There are two great superhero comics which explore social media and NSA surveillance in interesting ways.

I really think you should read these comics, if you're a programmer. Programming gives you incredible power to shape the way the world is changing, as code takes over nearly everything. But both the culture around programming, and the educations which typically shape programmers' perspectives, emphasize technical details at the expense of subjects like ethics, history, anthropology, and psychology, leading to incredibly obvious and idiotic mistakes with terrible consequences. With great power comes great responsibility, but at Google and Facebook, with great free snacks come great opportunities for utterly unnecessary douchebaggery.

A lot of people in the music industry talk about Google as evil. I don’t think they are evil. I think they, like other tech companies, are just idealistic in a way that works best for them... The people who work at Google, Facebook, etc can’t imagine how everything they make is not, like, totally awesome. If it’s not awesome for you it’s because you just don’t understand it yet and you’ll come around. They can’t imagine scenarios outside their reality and that is how they inadvertently unleash things like the algorithmic cruelty of Facebook’s yearly review (which showed me a picture I had posted after a doctor told me my husband had 6-8 weeks to live).

Fiction exists to explore issues like these, and in particular, fantastical fiction like sci-fi and superhero comics is extremely useful for exploring the impact of new technologies on a society. This is one of the major reasons fiction exists and has value, and these comics are doing an important job very effectively. (There's sci-fi to recommend here as well, but a lot of the people who were writing sci-fi about these topics seem to have almost given up.)

So here these comics are.

In 2013 and 2014, Peter Parker was dead. (Not really dead, just superhero dead.) The megalomaniac genius Otto Octavius, aka Dr. Octopus, was on the verge of dying from terminal injuries racked up during his career as a supervillian. So he tricked Peter Parker into swapping bodies with him, so that Parker died in Octavius's body and Octavius lived on inside Parker's. But in so doing, he acquired all of Parker's memories, and saw why Parker dedicated his life to being a hero. Octavius then chose to follow his example, but to do so with greater competence and intelligence, becoming the Superior Spider-Man.

The resulting comic book series was amazing. It's some of the best stuff I've ever seen in a whole lifetime reading comics.

Given that his competence and intelligence were indeed both superior, Octavius did actually do a much better job of being Spider-Man than Spider-Man himself had ever done, in some respects. (Likewise, as Peter Parker, he swiftly obtained a doctorate, launched a successful tech startup, and turned Parker's messy love life into something much simpler and healthier.) But given that Octavius was a megalomaniac asshole with no reservations about murdering people, he did a much worse job, in other respects.

In the comics, the Superior Spider-Man assassinates prominent criminals, blankets the entire city in a surveillance network comprised of creepy little eight-legged camera robots, taps into every communications network in all of New York, and uses giant robots to completely flatten a crime-ridden area, killing every inhabitant. (He also rants constantly, in hilariously overblown terms, like the verbose and condescending supervillian who he was for his entire previous lifetime.)

Along the way, Octavius meets "supervillians" who are merely pranksters -- kids who hit the mayor with a cream pie so they can tweet about it -- and he nearly kills them.

As every superhero does, of course, Peter Parker eventually comes back from the dead and saves the day. But during the course of the series' nearly two-year run, The Superior Spider-Man did an absolutely amazing job of illustrating how terrible it can be for a city to have a protector with incredible power and no ethical boundaries. Anybody who works for the NSA should read these comics before quitting their terrible careers in shame.

DC Comics, meanwhile, has rebooted Batgirl and made social media a major element of her life.

I only just started reading this series, and it's a fairly new reboot, but as far as I can tell, these new Batgirl comics are comics about social media which just happen to feature a superhero (or superheroine?) as their protagonist. She promotes her own activities on an Instagram-like site, uses it to track down criminals, faces impostors trying to leverage her fame for their own ends, and meets dates in her civilian life as Barbara Gordon through Hooq, a fictional Tinder equivalent.

The most important difference between these two series is that one ran for two years and is now over, while the other is just getting started. But here's my impression so far. Where Superior Spider-Man tackled robotics, ubiquitous surveillance, and an unethical "guardian" of law and order, Batgirl seems to be about the weird cultural changes that social media are creating.

Peter Parker's a photographer who works for a newspaper, and Clark Kent's a reporter, but this is their legacy as cultural icons created many decades ago. Nobody thinks of journalism as a logical career for a hero any more. Batgirl's a hipster in her civilian life, she beats up douchebag DJs, and I think she might work at a tech startup, but maybe she's in grad school. There's a fun contrast here; while the Superior Spider-Man's alter ego "Peter Parker," really Otto Octavius, basically represents the conflict between how Google and the NSA see themselves vs. how they look to everyone else — super genius vs. supervillain — Barbara Gordon looks a lot more like real life, or at least, the real life of people outside of that nightmarish power complex.

Update: sadly, Batgirl made a surprising transition from good to bad, followed by an unsurprising transition from bad to terrible.

Thursday, January 15, 2015

Why Panda Strike Wrote the Fastest JSON Schema Validator for Node.js

Update: We're not the fastest any more, but we still have the best benchmarks.

After reading this post, you will know:

Because this is a very long blog post, I've followed the GitHub README convention of making every header a link.

Those who do not understand HTTP are doomed to re-implement it on top of itself

Not everybody understands HTTP correctly. For instance, consider the /chunked_upload endpoint in the Dropbox API:

Uploads large files to Dropbox in multiple chunks. Also has the ability to resume if the upload is interrupted. This allows for uploads larger than the /files_put maximum of 150 MB.

Since this is an alternative to /files_put, you might wonder what the deal is with /files_put.

Uploads a file using PUT semantics. Note that this call goes to instead of

The preferred HTTP method for this call is PUT. For compatibility with browser environments, the POST HTTP method is also recognized.

To be fair to Dropbox, "for compatibility with browser environments" refers to the fact that, of the people I previously mentioned - the ones who do not understand HTTP - many have day jobs where they implement the three major browsers. I think "for compatibility with browser environments" also refers to the related fact that the three major browsers often implement HTTP incorrectly. Over the past 20 years, many people have noticed that their lives would be less stressful if the people who implemented the major browsers understood the standards they were implementing.

Consider HTTP Basic Auth. It's good enough for the GitHub API. Tons of people are perfectly happy to use it on the back end. But nobody uses it on the front end, because browsers built a totally unnecessary restriction into the model - namely, a hideous and unprofessional user experience. Consequently, people have been manually rebuilding their own branded, styled, and usable equivalents to Basic Auth for almost every app, ever since the Web began.

By pushing authentication towards the front end and away from an otherwise perfectly viable aspect of the fundamental protocol, browser vendors encouraged PHP developers to handle cryptographic issues, and discouraged HTTP server developers from doing so. This was perhaps not the most responsible move they could have made. Also, the total dollar value of the effort expended to re-implement HTTP Basic Auth on top of HTTP, in countless instances, over the course of twenty years, is probably an immense amount of money.

Returning to Dropbox, consider this part here again:

Uploads large files to Dropbox in multiple chunks. Also has the ability to resume if the upload is interrupted. This allows for uploads larger than the /files_put maximum of 150 MB.

Compare that to the accept-range header, from the HTTP spec:

One use case for this header is a chunked upload. Your server tells you the acceptable range of bytes to send along, your client sends the appropriate range of bytes, and you thereby chunk your upload.

Dropbox decided to take exactly this approach, with the caveat that the Dropbox API communicates an acceptable range of bytes using a JSON payload instead of an HTTP header.

Typical usage:

  • Send a PUT request to /chunked_upload with the first chunk of the file without setting upload_id, and receive an upload_id in return.
  • Repeatedly PUT subsequent chunks using the upload_id to identify the upload in progress and an offset representing the number of bytes transferred so far.
  • After each chunk has been uploaded, the server returns a new offset representing the total amount transferred.
  • After the last chunk, POST to /commit_chunked_upload to complete the upload.

Google Maps does something similar with its API. It differs from the Dropbox approach in that, instead of an endpoint, it uses a CGI query parameter. But Google Maps went a little further than Dropbox here. They decided that ignoring a perfectly good HTTP header was not good enough, and instead went so far as to invent new HTTP headers which serve the exact same purpose:

To initiate a resumable upload, make a POST or PUT request to the method's /upload URI, including an uploadType=resumable parameter:


For this initiating request, the body is either empty or it contains the metadata only; you'll transfer the actual contents of the file you want to upload in subsequent requests.

Use the following HTTP headers with the initial request:

  • X-Upload-Content-Type. Set to the media MIME type of the upload data to be transferred in subsequent requests.
  • X-Upload-Content-Length. Set to the number of bytes of upload data to be transferred in subsequent requests. If the length is unknown at the time of this request, you can omit this header.
  • Content-Length. Set to the number of bytes provided in the body of this initial request. Not required if you are using chunked transfer encoding.

It's possible that the engineers at Google and Dropbox know some limitation of Accept-Ranges that I don't. They're great companies, of course. But it's also possible they just don't know what they're doing, and that's my assumption here. If you've ever been to Silicon Valley and met some of these people, you're probably already assuming the same thing. Hiring great engineers is very difficult, even for companies like Google and Dropbox. Netflix holds terrific scaling challenges and its engineers are still only human.

Anyway, combine this with the decades-running example of HTTP Basic Auth, and it becomes painfully obvious that those who do not understand HTTP are doomed to re-implement it on top of itself.

If you're a developer who understands HTTP, you've probably seen many similar examples already. If not, trust me: they're out there. And this widespread propagation of HTTP-illiterate APIs imposes unnecessary and expensive problems in scaling, maintenance, and technical debt.

One example: you should version with the Accept header, not your URI, because:

Tying your clients into a pre-set understanding of URIs tightly couples the client implementation to the server; in practice, this makes your interface fragile, because any change can inadvertently break things, and people tend to like to change URIs over time.

But this opens up some broader questions about APIs, so let's take a step back for a second.

APIs and JSON Schema

If you're working on a modern web app, with the usual message queues and microservices, you're working on a distributed system.

Not long ago, a company had a bug in their app, which was a modern web app, with the usual message queues and microservices. In other words, they had a bug in their distributed system. Attempts to debug the issue turned into meetings to figure out how to debug the issue. The meetings grew bigger and bigger, bringing in more and more developers, until somebody finally discovered that one microservice was passing invalid data to another microservice.

So a Panda Strike developer told this company about JSON Schema.

Distributed systems often use schemas to prevent small bugs in data transmission from metastasizing into paralyzing mysteries or dangerous security failures. The Rails and Rubygems YAML bugs of 2013 provide a particularly alarming example of how badly things can go wrong when a distributed system's input is not type-safe. Rails used an attr_accessible/attr_protected system for most of its existence - at least as early as 2005 - but switched to its new "strong parameters" system with the release of Rails 4 in 2013.

Here's some "strong parameters" code:

This line in particular stands out as an odd choice for a line of code in a controller:

params.require(:email).permit(:first_name, :last_name, :shoe_size)

With verbs like require and permit, this is basically a half-assed, bolted-on implementation of a schema. It's a document, written in Ruby for some insane reason, located in a controller file for some even more insane reason, which articulates what data's required, and what data's permitted. That's a schema. attr_accessible and attr_protected served a similar purpose more crudely - the one defining a whitelist, the other a blacklist.

In Rails 3, you defined your schema with attr_accessible, which lived in the model. In Rails 4, you use "strong parameters," which go in the controller. (In fact, I believe most Rails developers today define their schema in Ruby twice - via "strong parameters," for input, and via ActiveModel::Serializer, for output.) When you see people struggling to figure out where to shoehorn some functionality into their system, it usually means they haven't figured out what that functionality is.

But we know it's a schema. So we can make more educated decisions about where to put it. In my opinion, whether you're using Rails or any other technology, you should solve this problem by providing a schema for your API, using the JSON Schema standard. Don't put schema-based input-filtering in your controller or your model, because data which fails to conform to the schema should never even reach application code in the first place.

There's a good reason that schemas have been part of distributed systems for decades. A schema formalizes your API, making life much easier for your API consumers - which realistically includes not only all your client developers, but also you yourself, and all your company's developers as well.

JSON Schema is great for this. JSON Schema provides a thorough and extensible vocabulary for defining the data your API can use. With it, any developer can very easily determine if their data's legit, without first swamping your servers in useless requests. JSON Schema's on draft 4, and draft 5 is being discussed. From draft 3 onwards, there's an automated test suite which anyone can use to validate their validators; JSON Schema is in fact itself a JSON schema which complies with JSON Schema.

Here's a trivial JSON Schema schema in CSON, which is just CoffeeScript JSON:

One really astonishing benefit of JSON Schema is that it makes it possible to create libraries which auto-generate API clients from JSON Schema definitions. Panda Strike has one such library, called Patchboard, which we've had terrific results with, and which I hope to blog about in future. Heroku also has a similar technology, written in Ruby, although their documentation contains a funny error:

We’ve also seen interest in this toolchain from API developers outside of Heroku, for example [reference customer]. We’d love to see more external adoption of this toolkit and welcome discussion and feedback about it.

That's an actual quote. Typos aside, JSON Schema makes life easier for ops at scale, both in Panda Strike's experience, and apparently in Heroku's experience as well.

JSON Schema vs Joi's proprietary format

However, although JSON Schema's got an active developer and user community, Walmart Labs has also had significant results with their Joi project, which leverages the benefits of an API schema, but defines that schema in JavaScript rather than JSON. Here's an example:

As part of the Hapi framework, Joi apparently powered 2013 Black Friday traffic for Walmart very successfully.

Hapi was able to handle all of Walmart mobile Black Friday traffic with about 10 CPU cores and 28Gb RAM (of course we used more but they were sitting idle at 0.75% load most of the time). This is mind blowing traffic going through VERY little resources.

(The Joi developers haven't explicitly stated what year this was, but my guess is 2013, because this quote was available before Black Friday this past year. Likewise, we don't know exactly how many requests they're talking about here, but it's pretty reasonable to assume "mind-blowing traffic" means a lot of traffic. And it's pretty reasonable to assume they were happy with Joi on Black Friday 2014 as well.)

I love this success story because it validates the general strategy of schema validation with APIs. But at the same time, Joi's developers aren't fans of JSON Schema.

On json-schema - we don't like it. It is hard to read, write, and maintain. It also doesn't support some of the relationships joi supports. We have no intention of supporting it. However, hapi will soon allow you to use whatever you want.

At Panda Strike, we haven't really had these problems, and JSON Schema has a couple advantages that Joi's custom format lacks.

The most important advantage: multi-language support. JSON's universality is quickly making it the default data language for HTTP, which is the default data transport for more or less everything in the world built after 1995. Defining your API schema in JSON means you can consume and validate it in any language you wish.

It might even be fair to leave off the "JS" and call it ON Schema, because in practice, JSON Schema validators will often allow you to pass them an object in their native languages. Here's a Ruby example:

This was not JSON; this was Ruby. In this example, you still have to use strings, but it'd be easy to circumvent that, in the classic Rails way, with the ActiveSupport library. Similar Python examples also exist. If you've built something with Python and JSON Schema, and you decide to rebuild in Ruby, you won't have to port the schema.

Crazy example, but it's equally true for Clojure, Go, or Node.js. And it's not at all difficult to imagine that a company might port services from Python or Ruby to Clojure, Go, or Node, especially if speed's essential for those services. At a certain point in a project's lifecycle, it's actually quite common to isolate some very specific piece of your system for a performance boost, and to rewrite some important slice of your app as a microservice, with a new focus on speed and scalability. Because of this, it makes a lot of sense to decouple an API's schema from the implementation language for any particular service which uses the API.

JSON Schema's universality makes it portable in a way that Joi's pure JavaScript schemas cannot achieve. (This is also true for the half-implemented pure-Ruby schemas buried inside Rails's "strong parameters" system.)

Another fun use case for JSON Schema: describing valid config files for any service written in any language. This might be annoying for those of you who prefer writing your config files in Ruby, or Clojure, or whatever language you prefer, but it has a lot of practical utility. The most obvious argument for JSON Schema is that it's a standard, which has a lot of inherent benefits, but the free bonus prize is that it's built on top of an essentially universal data description language.

And one final quibble with Joi: it throws some random, miscellaneous text munging into the mix, which doesn't make perfect sense as part of a schema validation and definition library.

JSCK: Fast as fuck

If it seems like I'm picking on Joi, there's a reason. Panda Strike's written a very fast JSON Schema validator, and in terms of performance, Joi is its only serious competitor.

Discussing a blog post on which benchmarked JSON Schema validators and found Joi to be too slow, a member of the Joi community said this:

Joi is actually a lot faster, from what I can tell, than any json schema validator. I question the above blog's benchmark and wonder if they were creating the joi schema as part of the iteration (which would be slower than creating it as setup).

The benchmark in question did make exactly that mistake in the case of JSV, one of the earliest JSON Schema validators for Node.js. I know this because Panda Strike built another of the very earliest JSON Schema validators for Node. It's called JSCK, and we've been benchmarking JSCK against every other Node.js JSON Schema validator we can find. Not only is it easily the fastest option available, in some cases it is faster by multiple orders of magnitude.

We initially thought that JSV was one of these cases, but we double-checked to be sure, and it turns out that the JSV README encourages the mistake of re-creating the schema on every iteration, as opposed to only during setup. We had thought JSCK was about 10,000 times faster than JSV, but when we corrected for this, we found that JSCK was only about 100 times faster.

(I filed a pull request to make the JSV README clearer, to prevent similar misunderstandings, but the project appears to be abandoned.)

So, indeed, the Cosmic Realms benchmarks do under-represent JSV's speed in this way, which means it's possible they under-represent Joi's speed in the same way also. I'm not actually sure. I hope to investigate in future, and I go into some relevant numbers further down in this blog post.

However, this statement seems very unlikely to me:

Joi is actually a lot faster, from what I can tell, than any json schema validator.

It is not impossible that Joi might turn out to be a few fractions of a millisecond faster than JSCK, under certain conditions, but Joi is almost definitely not "a lot faster" than JSCK.

Let's look at this in more detail.

JSCK benchmarks

The Cosmic Realms benchmarks use a trivial example schema; our benchmarks for JSCK use a trivial schema too, but we also use a more medium-complexity schema, and a very complex schema with nesting and other subtleties. We used a multi-schema benchmarking strategy to make the data more meaningful.

I'm going to show you these benchmarks, but first, here's the short version: JSCK is the fastest JSON Schema validator for Node.js - for both draft 3 and draft 4 of the spec, and for all three levels of complexity that I just mentioned.

Here's the long version. It's a matrix of libraries and schemas. We present the maximum, minimum, and median number of validations per second, for each library, against each schema, with the schemas organized by their complexity and JSON Schema draft. We also calculate the relative speed of each library, which basically means how many times slower than JSCK a given library is. For instance, in the chart below, json-gate is 3.4x to 3.9x slower than JSCK.

The jayschema results are an outlier, but JSCK is basically faster than anything.

When Panda Strike first created JSCK, few other JSON Schema valdiation libraries existed for Node.js. Since there are so many new alternatives, it's pretty exciting to see that JSCK remains the fastest option.

However, if you're also considering Joi, my best guess is that, for trivial schemas, Joi is about the same speed as JSCK, which is obviously pretty damn fast. I can't currently say anything about its relative performance on complex schemas, but I can say that much.

Here's why. There's a project called enjoi which automatically converts trivial JSON Schemas to Joi's format. It ships with benchmarks against tv4. The benchmarks run a trivial schema, and this is how they look on my box:

tv4 vs joi benchmark:

  tv4: 22732 operations/second. (0.0439918ms)
  joi: 48115 operations/second. (0.0207834ms)

For a trivial draft 4 schema, Joi is more than twice as fast as tv4. Our benchmarks show that for trivial draft 4 schemas, JSCK is also more than twice as fast as tv4. So, until I've done further investigation, I'm happy to say they look to be roughly the same speed.

However, JSCK's speed advantage over tv4 increases to 5x with a more complex schema. As far as I can tell, nobody's done the work to translate a complex JSON Schema into Joi's format and benchmark the results. So there's no conclusive answer yet for the question of how Joi's speed holds up against greater complexity.

Also, of course, these specific results are dependent on the implementation details of enjoi's schema translation, and if you make any comparison between Joi and a JSON Schema validator, you should remember there's an apples-to-oranges factor.

Nonetheless, JSCK is very easily the fastest JSON Schema validator for Node.js, and although Joi might be able to keep up in terms of performance, a) it might not, and b) either way, its format locks you into a specific language, whereas JSON Schema gives you wide portability and an extraordinary diversity of options.

We are therefore very proud to recommend that you use JSCK if you want fast JSON Schema validation in Node.js.

I'm doing a presentation about JSCK at ForwardJS in early February. Check it out if you're in San Francisco.

Wednesday, January 7, 2015

One Major Difference Between Clojure And Common Lisp

In the summer of 2013, I attended an awesome workshop called WACM (Workshop on Algorithmic Computer Music) at the University of California at Santa Cruz. Quoting from the WACM site:

Students will learn the Lisp computer programming language and create their own composition and analysis software. The instruction team will be led by professor emeritus David Cope, noted composer, author, and programmer...

The program features intensive classes on the basic techniques of algorithmic composition and algorithmic music analysis, learning and using the computer programming language Lisp. Students will learn about Markov-based rules programs, genetic algorithms, and software modeled on the Experiments in Musical Intelligence program. Music analysis software and techniques will also be covered in depth. Many compositional approaches will be discussed in detail, including rules-based techniques, data-driven models, genetic algorithms, neural networks, fuzzy logic, mathematical modeling, and sonification. Software programs such as Max, Open Music, and others will also be presented.

It was as awesome as it sounds, with some caveats; for instance, it was a lot to learn inside of two weeks. I was one of a very small number of people there with actual programming experience; most of the attendees either had music degrees or were in the process of getting them. We worked in Common Lisp, but I came with a bunch of Clojure books (in digital form) and the goal of building stuff using Overtone.

I figured I could just convert Common Lisp code almost directly into Clojure, but it didn't work. Here's a gist I posted during the workshop:

This attempt failed for a couple different reasons, as you can see if you read the comments. First, this code assumes that (if (null list1)) in Common Lisp will be equivalent to (if (nil? list1)) in Clojure, but Clojure doesn't consider an empty list to have a nil value. Secondly, this code tries to handle lists in the classic Lisp way, with recursion, and that's not what you typically do in Clojure.

Clojure's reliance on the JVM makes recursion inconvenient. And Clojure uses list comprehensions, along with very sophisticated, terse destructuring assignments, to churn through lists much more gracefully than my Common Lisp code above. Those 7 lines of Common Lisp compress to 2 lines of Clojure:

(defn build [seq1 seq2]
  (for [elem1 seq1 elem2 seq2] [elem1 elem2]))

A friend of mine once said at a meetup that Clojure isn't really a Lisp; it's "a fucked-up functional language" with all kinds of weird quirks which uses Lisp syntax out of nostalgia more than anything else. To me, this isn't enough to earn Clojure that judgement, which was kinda harsh. I think I like Clojure more than he does. But, at the same time, if you're looking to translate stuff from other Lisps into Clojure, it's not going to be just copying and pasting. Beyond inconsequential, dialect-level differences like defn vs. defun, there are deeper differences which steepen the learning curve a little.

Monday, January 5, 2015

Versioning Is A Nuanced Social Fiction; SemVer Is A Blunt Instrument

David Heinemeier-Hansson said something relatively lucid and wise on Twitter recently:

To his credit, he also realized that somebody else had already said it better.

Here's the nub and the gist of Jeremy Ashkenas's Gist:

SemVer tries to compress a huge amount of information — the nature of the change, the percentage of users that will be affected by the change, the severity of the change (Is it easy to fix my code? Or do I have to rewrite everything?) — into a single number. And unsurprisingly, it's impossible for that single number to contain enough meaningful information...

Ultimately, SemVer is a false promise that appeals to many developers — the promise of pain-free, don't-have-to-think-about-it, updates to dependencies. But it simply isn't true.

It's extremely worthwhile to read the whole thing.

Here's how I see version numbers: they predate Git, and Git makes version numbers pretty stupid if you take those numbers literally, because we now use hashes like 64f2a2451381c80dff1 to identify specific versions of our code bases. Strictly speaking, version numbers are fictional. If you really want to know what version you're looking at, the answer to that question is not a number at all, but a Git hash.

But we still use version numbers. We do this for the same reason that, even if we one day replace every car on the road with an error-proof robot which is only capable of perfect driving, we will still have speed limits, brake lights, and traffic signs. It's the same reason there's an urban legend that the width of the Space Shuttle ultimately derives from the width of roads in Imperial Rome: systems often outlive their original purposes.

Version numbers were originally used to identify specific versions of a code base, but that hasn't been strictly accurate since the invention of version control systems, whose history goes back at least 43 years, to 1972. As version control systems became more and more fine-grained, version numbers diverged further and further from the actual identifiers we use to index our versioning systems, and thus "version numbers" became more and more a social fiction.

Note that this is not necessarily a bad thing. Money is a social fiction, and an incredibly useful one. But SemVer is an attempt to treat the complexities of a social fiction as if they were very deterministic and controlled.

They are not.

Which means SemVer is an attempt to brutally oversimplify an inherently complex problem.

There's a lot of good commentary on these complexities. Justin Searls gave a very good presentation which goes into why these problems are inherently complex, and inherently social.

I'm not saying that I don't think SemVer's goals are important. But I do think SemVer's a clumsy replacement for nuanced versioning, and an incomplete answer for "how do we demarcate incompatibility risks in systems made up of extremely numerous libraries written by extremely numerous people?"

Because version numbers are a social fiction, entirely distinct from the "numbers" we use to actually version our software in modern version control systems, choosing new version numbers is primarily a matter of communicating with your users. Like all communication, it is inherently complex and nuanced. If it is possible at all to reliably automate the communication of nuance, the medium of communication will probably not be a trio of numbers, because the problem space simply has far more dimensions than three.

But for the same reason, I kind of think version numbers verge on ridiculous whether they're trying to color within the SemVer lines or not. There's only so much weight you can expect a social fiction to carry before it cracks at the seams and falls apart. Even the idea of a canonical repo is a little silly in my opinion.

You can see why the canonical repo is a mistake if you look at a common antipattern on GitHub: a project is abandoned, but development continues within multiple forks of the project. Which repo is now canonical? You have to examine each fork, and discover how well it keeps up with the overall, now-decentralized progress of the project. You'll often find that Fork A does a better job with updates related to one aspect of the project, while Fork B does a better job with updates related to another aspect. And it's a manual process; no GitHub view exists which will make it particularly easy for you to determine which of the still-in-progress forks are continuing ahead of the "canonical" repo.

At the very least, in a situation like this, you have to differentiate between the original repo and the canonical one. I think that much is indisputable. But I'd argue also that the basic idea of a canonical repo operates in defiance of the entire history of human language. In fact, rumor has it that GitHub itself runs on a private fork of Rails 2, which illustrates my point perfectly, by constituting a local dialect.

(Update: GitHub ran on a private fork of Rails 2 for many years, but moved to Rails 3 in September 2014. Thanks to Florian Gilcher for the details.)

I'd like to see some anthropologists and linguists research our industry, because the modern dev world, with its countless and intricately interwoven dependencies, presents some really complex and subtle problems.