Thursday, January 15, 2015

Why Panda Strike Wrote the Fastest JSON Schema Validator for Node.js

Update: We're not the fastest any more, but we still have the best benchmarks.

After reading this post, you will know:

Because this is a very long blog post, I've followed the GitHub README convention of making every header a link.

Those who do not understand HTTP are doomed to re-implement it on top of itself


Not everybody understands HTTP correctly. For instance, consider the /chunked_upload endpoint in the Dropbox API:

Uploads large files to Dropbox in multiple chunks. Also has the ability to resume if the upload is interrupted. This allows for uploads larger than the /files_put maximum of 150 MB.

Since this is an alternative to /files_put, you might wonder what the deal is with /files_put.

Uploads a file using PUT semantics. Note that this call goes to api-content.dropbox.com instead of api.dropbox.com.

The preferred HTTP method for this call is PUT. For compatibility with browser environments, the POST HTTP method is also recognized.

To be fair to Dropbox, "for compatibility with browser environments" refers to the fact that, of the people I previously mentioned - the ones who do not understand HTTP - many have day jobs where they implement the three major browsers. I think "for compatibility with browser environments" also refers to the related fact that the three major browsers often implement HTTP incorrectly. Over the past 20 years, many people have noticed that their lives would be less stressful if the people who implemented the major browsers understood the standards they were implementing.

Consider HTTP Basic Auth. It's good enough for the GitHub API. Tons of people are perfectly happy to use it on the back end. But nobody uses it on the front end, because browsers built a totally unnecessary restriction into the model - namely, a hideous and unprofessional user experience. Consequently, people have been manually rebuilding their own branded, styled, and usable equivalents to Basic Auth for almost every app, ever since the Web began.

By pushing authentication towards the front end and away from an otherwise perfectly viable aspect of the fundamental protocol, browser vendors encouraged PHP developers to handle cryptographic issues, and discouraged HTTP server developers from doing so. This was perhaps not the most responsible move they could have made. Also, the total dollar value of the effort expended to re-implement HTTP Basic Auth on top of HTTP, in countless instances, over the course of twenty years, is probably an immense amount of money.

Returning to Dropbox, consider this part here again:

Uploads large files to Dropbox in multiple chunks. Also has the ability to resume if the upload is interrupted. This allows for uploads larger than the /files_put maximum of 150 MB.

Compare that to the accept-range header, from the HTTP spec:

One use case for this header is a chunked upload. Your server tells you the acceptable range of bytes to send along, your client sends the appropriate range of bytes, and you thereby chunk your upload.

Dropbox decided to take exactly this approach, with the caveat that the Dropbox API communicates an acceptable range of bytes using a JSON payload instead of an HTTP header.

Typical usage:

  • Send a PUT request to /chunked_upload with the first chunk of the file without setting upload_id, and receive an upload_id in return.
  • Repeatedly PUT subsequent chunks using the upload_id to identify the upload in progress and an offset representing the number of bytes transferred so far.
  • After each chunk has been uploaded, the server returns a new offset representing the total amount transferred.
  • After the last chunk, POST to /commit_chunked_upload to complete the upload.

Google Maps does something similar with its API. It differs from the Dropbox approach in that, instead of an endpoint, it uses a CGI query parameter. But Google Maps went a little further than Dropbox here. They decided that ignoring a perfectly good HTTP header was not good enough, and instead went so far as to invent new HTTP headers which serve the exact same purpose:

To initiate a resumable upload, make a POST or PUT request to the method's /upload URI, including an uploadType=resumable parameter:

POST https://www.googleapis.com/upload/mapsengine/v1/rasters/{asset_id}/files
 ?filename={filename}
 &uploadType=resumable

For this initiating request, the body is either empty or it contains the metadata only; you'll transfer the actual contents of the file you want to upload in subsequent requests.

Use the following HTTP headers with the initial request:

  • X-Upload-Content-Type. Set to the media MIME type of the upload data to be transferred in subsequent requests.
  • X-Upload-Content-Length. Set to the number of bytes of upload data to be transferred in subsequent requests. If the length is unknown at the time of this request, you can omit this header.
  • Content-Length. Set to the number of bytes provided in the body of this initial request. Not required if you are using chunked transfer encoding.

It's possible that the engineers at Google and Dropbox know some limitation of Accept-Ranges that I don't. They're great companies, of course. But it's also possible they just don't know what they're doing, and that's my assumption here. If you've ever been to Silicon Valley and met some of these people, you're probably already assuming the same thing. Hiring great engineers is very difficult, even for companies like Google and Dropbox. Netflix holds terrific scaling challenges and its engineers are still only human.


Anyway, combine this with the decades-running example of HTTP Basic Auth, and it becomes painfully obvious that those who do not understand HTTP are doomed to re-implement it on top of itself.

If you're a developer who understands HTTP, you've probably seen many similar examples already. If not, trust me: they're out there. And this widespread propagation of HTTP-illiterate APIs imposes unnecessary and expensive problems in scaling, maintenance, and technical debt.

One example: you should version with the Accept header, not your URI, because:

Tying your clients into a pre-set understanding of URIs tightly couples the client implementation to the server; in practice, this makes your interface fragile, because any change can inadvertently break things, and people tend to like to change URIs over time.

But this opens up some broader questions about APIs, so let's take a step back for a second.

APIs and JSON Schema


If you're working on a modern web app, with the usual message queues and microservices, you're working on a distributed system.

Not long ago, a company had a bug in their app, which was a modern web app, with the usual message queues and microservices. In other words, they had a bug in their distributed system. Attempts to debug the issue turned into meetings to figure out how to debug the issue. The meetings grew bigger and bigger, bringing in more and more developers, until somebody finally discovered that one microservice was passing invalid data to another microservice.

So a Panda Strike developer told this company about JSON Schema.

Distributed systems often use schemas to prevent small bugs in data transmission from metastasizing into paralyzing mysteries or dangerous security failures. The Rails and Rubygems YAML bugs of 2013 provide a particularly alarming example of how badly things can go wrong when a distributed system's input is not type-safe. Rails used an attr_accessible/attr_protected system for most of its existence - at least as early as 2005 - but switched to its new "strong parameters" system with the release of Rails 4 in 2013.

Here's some "strong parameters" code:

This line in particular stands out as an odd choice for a line of code in a controller:

params.require(:email).permit(:first_name, :last_name, :shoe_size)

With verbs like require and permit, this is basically a half-assed, bolted-on implementation of a schema. It's a document, written in Ruby for some insane reason, located in a controller file for some even more insane reason, which articulates what data's required, and what data's permitted. That's a schema. attr_accessible and attr_protected served a similar purpose more crudely - the one defining a whitelist, the other a blacklist.

In Rails 3, you defined your schema with attr_accessible, which lived in the model. In Rails 4, you use "strong parameters," which go in the controller. (In fact, I believe most Rails developers today define their schema in Ruby twice - via "strong parameters," for input, and via ActiveModel::Serializer, for output.) When you see people struggling to figure out where to shoehorn some functionality into their system, it usually means they haven't figured out what that functionality is.

But we know it's a schema. So we can make more educated decisions about where to put it. In my opinion, whether you're using Rails or any other technology, you should solve this problem by providing a schema for your API, using the JSON Schema standard. Don't put schema-based input-filtering in your controller or your model, because data which fails to conform to the schema should never even reach application code in the first place.

There's a good reason that schemas have been part of distributed systems for decades. A schema formalizes your API, making life much easier for your API consumers - which realistically includes not only all your client developers, but also you yourself, and all your company's developers as well.

JSON Schema is great for this. JSON Schema provides a thorough and extensible vocabulary for defining the data your API can use. With it, any developer can very easily determine if their data's legit, without first swamping your servers in useless requests. JSON Schema's on draft 4, and draft 5 is being discussed. From draft 3 onwards, there's an automated test suite which anyone can use to validate their validators; JSON Schema is in fact itself a JSON schema which complies with JSON Schema.

Here's a trivial JSON Schema schema in CSON, which is just CoffeeScript JSON:

One really astonishing benefit of JSON Schema is that it makes it possible to create libraries which auto-generate API clients from JSON Schema definitions. Panda Strike has one such library, called Patchboard, which we've had terrific results with, and which I hope to blog about in future. Heroku also has a similar technology, written in Ruby, although their documentation contains a funny error:

We’ve also seen interest in this toolchain from API developers outside of Heroku, for example [reference customer]. We’d love to see more external adoption of this toolkit and welcome discussion and feedback about it.

That's an actual quote. Typos aside, JSON Schema makes life easier for ops at scale, both in Panda Strike's experience, and apparently in Heroku's experience as well.

JSON Schema vs Joi's proprietary format


However, although JSON Schema's got an active developer and user community, Walmart Labs has also had significant results with their Joi project, which leverages the benefits of an API schema, but defines that schema in JavaScript rather than JSON. Here's an example:

As part of the Hapi framework, Joi apparently powered 2013 Black Friday traffic for Walmart very successfully.

Hapi was able to handle all of Walmart mobile Black Friday traffic with about 10 CPU cores and 28Gb RAM (of course we used more but they were sitting idle at 0.75% load most of the time). This is mind blowing traffic going through VERY little resources.

(The Joi developers haven't explicitly stated what year this was, but my guess is 2013, because this quote was available before Black Friday this past year. Likewise, we don't know exactly how many requests they're talking about here, but it's pretty reasonable to assume "mind-blowing traffic" means a lot of traffic. And it's pretty reasonable to assume they were happy with Joi on Black Friday 2014 as well.)

I love this success story because it validates the general strategy of schema validation with APIs. But at the same time, Joi's developers aren't fans of JSON Schema.

On json-schema - we don't like it. It is hard to read, write, and maintain. It also doesn't support some of the relationships joi supports. We have no intention of supporting it. However, hapi will soon allow you to use whatever you want.

At Panda Strike, we haven't really had these problems, and JSON Schema has a couple advantages that Joi's custom format lacks.

The most important advantage: multi-language support. JSON's universality is quickly making it the default data language for HTTP, which is the default data transport for more or less everything in the world built after 1995. Defining your API schema in JSON means you can consume and validate it in any language you wish.

It might even be fair to leave off the "JS" and call it ON Schema, because in practice, JSON Schema validators will often allow you to pass them an object in their native languages. Here's a Ruby example:

This was not JSON; this was Ruby. In this example, you still have to use strings, but it'd be easy to circumvent that, in the classic Rails way, with the ActiveSupport library. Similar Python examples also exist. If you've built something with Python and JSON Schema, and you decide to rebuild in Ruby, you won't have to port the schema.

Crazy example, but it's equally true for Clojure, Go, or Node.js. And it's not at all difficult to imagine that a company might port services from Python or Ruby to Clojure, Go, or Node, especially if speed's essential for those services. At a certain point in a project's lifecycle, it's actually quite common to isolate some very specific piece of your system for a performance boost, and to rewrite some important slice of your app as a microservice, with a new focus on speed and scalability. Because of this, it makes a lot of sense to decouple an API's schema from the implementation language for any particular service which uses the API.

JSON Schema's universality makes it portable in a way that Joi's pure JavaScript schemas cannot achieve. (This is also true for the half-implemented pure-Ruby schemas buried inside Rails's "strong parameters" system.)

Another fun use case for JSON Schema: describing valid config files for any service written in any language. This might be annoying for those of you who prefer writing your config files in Ruby, or Clojure, or whatever language you prefer, but it has a lot of practical utility. The most obvious argument for JSON Schema is that it's a standard, which has a lot of inherent benefits, but the free bonus prize is that it's built on top of an essentially universal data description language.

And one final quibble with Joi: it throws some random, miscellaneous text munging into the mix, which doesn't make perfect sense as part of a schema validation and definition library.

JSCK: Fast as fuck


If it seems like I'm picking on Joi, there's a reason. Panda Strike's written a very fast JSON Schema validator, and in terms of performance, Joi is its only serious competitor.

Discussing a blog post on cosmicrealms.com which benchmarked JSON Schema validators and found Joi to be too slow, a member of the Joi community said this:

Joi is actually a lot faster, from what I can tell, than any json schema validator. I question the above blog's benchmark and wonder if they were creating the joi schema as part of the iteration (which would be slower than creating it as setup).

The benchmark in question did make exactly that mistake in the case of JSV, one of the earliest JSON Schema validators for Node.js. I know this because Panda Strike built another of the very earliest JSON Schema validators for Node. It's called JSCK, and we've been benchmarking JSCK against every other Node.js JSON Schema validator we can find. Not only is it easily the fastest option available, in some cases it is faster by multiple orders of magnitude.

We initially thought that JSV was one of these cases, but we double-checked to be sure, and it turns out that the JSV README encourages the mistake of re-creating the schema on every iteration, as opposed to only during setup. We had thought JSCK was about 10,000 times faster than JSV, but when we corrected for this, we found that JSCK was only about 100 times faster.

(I filed a pull request to make the JSV README clearer, to prevent similar misunderstandings, but the project appears to be abandoned.)

So, indeed, the Cosmic Realms benchmarks do under-represent JSV's speed in this way, which means it's possible they under-represent Joi's speed in the same way also. I'm not actually sure. I hope to investigate in future, and I go into some relevant numbers further down in this blog post.

However, this statement seems very unlikely to me:

Joi is actually a lot faster, from what I can tell, than any json schema validator.

It is not impossible that Joi might turn out to be a few fractions of a millisecond faster than JSCK, under certain conditions, but Joi is almost definitely not "a lot faster" than JSCK.

Let's look at this in more detail.

JSCK benchmarks


The Cosmic Realms benchmarks use a trivial example schema; our benchmarks for JSCK use a trivial schema too, but we also use a more medium-complexity schema, and a very complex schema with nesting and other subtleties. We used a multi-schema benchmarking strategy to make the data more meaningful.

I'm going to show you these benchmarks, but first, here's the short version: JSCK is the fastest JSON Schema validator for Node.js - for both draft 3 and draft 4 of the spec, and for all three levels of complexity that I just mentioned.

Here's the long version. It's a matrix of libraries and schemas. We present the maximum, minimum, and median number of validations per second, for each library, against each schema, with the schemas organized by their complexity and JSON Schema draft. We also calculate the relative speed of each library, which basically means how many times slower than JSCK a given library is. For instance, in the chart below, json-gate is 3.4x to 3.9x slower than JSCK.

The jayschema results are an outlier, but JSCK is basically faster than anything.

When Panda Strike first created JSCK, few other JSON Schema valdiation libraries existed for Node.js. Since there are so many new alternatives, it's pretty exciting to see that JSCK remains the fastest option.

However, if you're also considering Joi, my best guess is that, for trivial schemas, Joi is about the same speed as JSCK, which is obviously pretty damn fast. I can't currently say anything about its relative performance on complex schemas, but I can say that much.

Here's why. There's a project called enjoi which automatically converts trivial JSON Schemas to Joi's format. It ships with benchmarks against tv4. The benchmarks run a trivial schema, and this is how they look on my box:

tv4 vs joi benchmark:

  tv4: 22732 operations/second. (0.0439918ms)
  joi: 48115 operations/second. (0.0207834ms)

For a trivial draft 4 schema, Joi is more than twice as fast as tv4. Our benchmarks show that for trivial draft 4 schemas, JSCK is also more than twice as fast as tv4. So, until I've done further investigation, I'm happy to say they look to be roughly the same speed.

However, JSCK's speed advantage over tv4 increases to 5x with a more complex schema. As far as I can tell, nobody's done the work to translate a complex JSON Schema into Joi's format and benchmark the results. So there's no conclusive answer yet for the question of how Joi's speed holds up against greater complexity.

Also, of course, these specific results are dependent on the implementation details of enjoi's schema translation, and if you make any comparison between Joi and a JSON Schema validator, you should remember there's an apples-to-oranges factor.

Nonetheless, JSCK is very easily the fastest JSON Schema validator for Node.js, and although Joi might be able to keep up in terms of performance, a) it might not, and b) either way, its format locks you into a specific language, whereas JSON Schema gives you wide portability and an extraordinary diversity of options.

We are therefore very proud to recommend that you use JSCK if you want fast JSON Schema validation in Node.js.

I'm doing a presentation about JSCK at ForwardJS in early February. Check it out if you're in San Francisco.