Thursday, January 21, 2010

January AI/Machine Learning Wishlist

Apropos of the last link, I actually forked an NLP project written in C at Carnegie-Mellon and put it on GitHub last year. I don't know C; I just made a tweak or two to make it easy to install the Ruby library which accesses this C project. I needed the library for a project to detect coherent comments on YouTube, which as many people realize is an epic needle/haystack problem.

I solved this problem, but I did it in a very hacky way. I had read just enough NLP to make me dangerous. I used Ruby-LinkParser to count the number of linkages in the link grammar and then simply set a threshold minimum, below which all comments would be considered incoherent. It wasn't really a measure of sentence coherence as much as sentence grammatical complexity, but on YouTube - where most "sentences" fail to contain even one linkage, or indeed one word of actual English - it came close enough to get the job done.

A sentence like:

It is, upon reflection, a discovery which could, under some circumstances, but not all, prove useful to other programmers and to other projects.

has many, many more places where a grammatical fragment links to another grammatical fragment than does a sentence like:

omg lol

We kicked around the idea of using Bayesian classifiers for this project, but I steered the discussion away from that approach as far and as fast as I could. I don't know Bayes inside-out, but I do know a researcher who worked on some of the top Bayesian research out there, and I have it on very good authority that Bayesian networks require a lot of data to deduce even the simplest, most obvious correlations.

Peter Norvig, the Google scientist who two fantastic books on AI and Lisp, is so into Bayes that he's popularized the "philosophical Bayesian" viewpoint. At a company like Google - which has more data on its hands than any other non-human entity has ever had in the known history of the planet - it's probably very easy for probability to become that magic hammer which makes every problem look like a nail. For example, many people have read Norvig's brief and powerful explanation of how it's easier to write a spellchecker around probability than is to write it around a dictionary.

However, if you've noticed that Google has in the past few years gotten less useful, it's because they lean too hard on Bayes and other probabilistic methods. Try googling "-background CSS" and finding a search result with the string '-background' in it. Even if you set the advanced search options to "this exact string" it won't happen. The probability that you typed that dash by accident is so much higher than the probability that you meant it literally that Google doesn't bother telling you - especially since the probability that you literally meant "this exact string" when you requested it is actually lower than the probability that the little dash was a typo. The result: Google's useless if you need to find an unusual string which is very similar to a much more common one.

I didn't want to write this YouTube coherency detector around probability, partly because we weren't dealing with a Google-style volume of data, partly because a system which requires lots of data to draw conclusions is never going to be able to draw conclusions about coherent YouTube comments because lots of coherent YouTube comments don't exist - only a few exist - but mostly because the problem space had literally nothing to do with Bayesian classifiers or what they're good at. I was able to get the desired result with a totally ghetto hack because I had spent a lot of time before that reading about AI and machine learning. Totally ghetto hacks can get you a long way in AI, as long as you know where and when to use them.

Anyway, I love AI, so I can't get wait to get these books, but one word of caution: getting into this type of programming can make everything else look kinda petty and stupid.

Some links in this blog post are affiliate links, which pay small sales commissions.