Friday, March 7, 2008

Do Users Really Even Exist?

It might sound like a crazy question, so here's an alternate version: is it even sane to assume a one-to-one mapping between usernames and people?

If that's one too provocative, how about this one: do you have more than one e-mail address?

OpenID is a cool idea, and some of its aspects are brilliant. But let's take an example. LiveJournal supports OpenID. So does Digg. The idea that I would want the same login at LiveJournal, where I post my personal soap operas semi-privately to very old friends with too much time on their hands, and Digg, where I post my own blog entries to promote them to the worldwide professional developer community, is flawed at best. (I don't actually post my own stuff to Digg, or in fact anybody's stuff, but I know there are bloggers who do, and that's kind of what the system is for.)

The other day I was reading this very interesting Google white paper on Google News and was struck by the staggering contrast between wizardly mathematical technique and complete idiot foolishness.

The paper describes efforts to recommend stories to users, in the context of a news site (duh), which obviously means a stream of continually changing stories. It says that the usual Bayesian filtering suffers here, because the newness of news items is kind of a fundamental aspect, and stories which were interesting become less interesting when they are less new. It goes on to describe a very complex method for defining user clusters, mapping likely interest to clusters rather than users, and then mapping users to clusters.

The fundamental failure here is the failure to recognize that communities are users. The clusters Google News identifies are not just clusters of users, they are also users in their own right. A cluster of users is a user.

The fundamental task of making code elegant is to make only those distinctions which are logically necessary. There is absolutely no reason for a site which makes recommendations based on user preferences to distinguish between users and clusters of users, especially when a system with high enough resolution will be able to identify when a cluster of users corresponds to an individual human being.

How many "users" are in this picture?

The fundamental strength of Bayesian analysis is that a set of probabilities is only a set of probabilities. Two users with identical likelihood of clicking the same link are the same thing to a system performing Bayesian analysis. They will sometimes be the same thing in real life. So say I'm a system which separates out individual "users" (e-mail addresses) from a cluster, and the cluster in fact corresponds exactly to this guy at his computer. I've just thrown away valuable data.

There's a great deal of truly complex math in this Google paper, but the authors seem to have never taken even a moment to ask themselves what the formulae represent. A cluster of users can be a user. Even in the statistically more likely scenario where a cluster of users represents a set of people, however, it's still more valuable to target the cluster as your user than it is to target any one specific human, because there are a lot of situations where the person doesn't make the decision and the community does. In social software, societies are users. Google's following a truly insane strategy here: we have Google Search, which is one size fits all, and we have Google News, which is customized for you personally, and we have absolutely nothing in between.

Is it really more valuable to know what Giles Bowkett is interested in at any given moment than it is to know what the overall Ruby community is interested in? Of course not. Part of the reason people read the news is to know what other people are going to be talking about. Tracking what communities are reading is at least as useful for a news site as tracking what users are reading. The clusters part of the Google strategy made perfect sense, but factoring it back out to the user level was just wasted effort.

After all, it works the other way around, too. Just as a human can have more than one login, a login can have more than one human. My parents share the e-mail address they use to buy and sell on eBay, and any calculation Google News might make about that "user's" interests in the news will be skewed by whether, on that particular day, it happens to be my mom selling a first-edition James Bond novel or my dad buying solar panels. A small business I sometimes freelance for shares its web hosting login and its Google Checkout login throughout the company. A few months ago I was sharing my Campfire login, on a particular domain, with a robot I was writing. Most people who read my blog are probably Ruby programmers - which means you're probably following Err The Blog and both the guys who write it on Twitter.

Err The Blog is "following" its authors.

Assuming a one-to-one mapping between users and people utterly disregards practical experience and common sense. One human could represent themselves to your system with several logins. Several humans could represent themselves as one login. Both these things could happen. Both these things probably will happen.

If you're designing a system which performs machine learning against the behavior of individuals, you don't need to know whether those individuals are human or groups composed of humans, or a group composed of humans, but only under particular circumstances, or a human, only under particular circumstances. Mathematically, Bayes eliminates that question anyway. So you really don't need to know, and it's a waste of energy to even guess. Even if you could say logins and users mapped predictably to one another, it wouldn't be that useful. In between "one size fits all" and personal customization (for delusional values of "personal"), there's a huge range of community sizes and scales, all of which would give Google News a much better edge than its silly "Recommended stories" functionality.