Tuesday, July 27, 2010

How To Annoy Co-Workers With Extremely Lazy AI

I have annoyed many co-workers in many different ways over a long and varied career. I once annoyed a co-worker by obstructing his path with a Furby, a lightsaber, and a small crowd of people who had gathered to see me strike the Furby with the lightsaber. I only annoyed this co-worker slightly; on other occasions, I annoyed other co-workers more.

One way I've annoyed a co-worker in the past: I failed to open source a particular piece of code. We had a client which, for various reasons, needed to find coherent, grammatical, and positive comments on YouTube videos. This was an epic needle-in-a-haystack problem - and we had to solve it before the first-ever working sarcasm detector.



When I say "we had to solve it," what I really mean is "I had to solve it." Here's how I did it. First, I ran a comment downloader. YouTube rate-limits API calls; to get around this, the comment downloader would sleep for a brief period between each API call, and continue downloading comments until no more were available. This often resulted in large corpuses of thousands of comments. I used Python for the comment downloader, because of the easy availability of Google API client libraries, and switched to Ruby to process the corpus. (If you know anything about AI, you know that's crazy, but hey, I did it, and it worked.)

You might imagine that Bayesian networks would be the next place to take this large corpus. You would be imagining wrong. We tried that and it was an epic timesink. Instead, I first ran the comments against a blacklist filled with slurs, zomgs, and <3s. Grading for positiveness also used regexes. Next I ran it through linkparser, my fork of Ruby-LinkParser, which differs from the original only in the addition of some install documentation and a few changes to the C code which make installation possible (and which I really just guess-worked my way through). Ruby-LinkParser describes itself thus:

A high-level interface to the CMU Link Grammar.

This binding wraps the link-grammar shared library provided by the AbiWord project for their grammar-checker.


Using this wrapper, it's trivial to generates counts of grammatical linkages for arbitrary sentences. The AbiWord docs provide links to the original white papers; here, I'll just summarize. In a sentence like "I farted and my brain exploded," "and" is a grammatical link. It links the two sub-sentences, or, more accurately, the two sentence-like sub-structures. There aren't a lot of grammatical links there, because it's a simple sentence. Many many more grammatical links lurk in a sentence like "I farted, which I did to expel gas (which relieved cramping in my stomach and anus), and my brain, which I use mainly to remember or imagine what various women look like with no clothes on, exploded, which inconvenienced me considerably, and incurred substantial medical costs, because, despite the trivial and/or foolish uses to which my brain is usually put, it remains a finely tuned machine which is difficult to replace and expensive to repair."

I'm no linguist, and it's been a while since I read the papers, but I'll highlight here what I believe the grammatical linkages in that sentence are:

I farted, which I did to expel gas (which relieved cramping in my stomach and anus), and my brain, which I use mainly to remember or imagine what various women look like with no clothes on, exploded, which inconvenienced me considerably, and incurred substantial medical costs, because, despite the trivial and/or foolish uses to which my brain is usually put, it remains a finely tuned machine which is difficult to replace and expensive to repair.

I'm actually over-simplifying here - linkages can be grammatical constructs as well as words - but you get the idea. As you can see, this longer sentence has many more grammatical linkages. (It would also have failed to qualify on positivity grounds, due to the presence of the word "fart," but it's just an example.) What I found, after isolating a ton of good comments and bad comments from our corpus to use as examples, was that the only comments that ever had significant numbers of grammatical linkages also happened to be coherent. There were a very few coherent sentences with low grammatical linkages, but the vast majority had high link counts, and it was very, very easy to isolate a threshold value for link counts, above which no incoherent sentence ever passed the filter.

So that's what I did. And my co-workers were like, hey, you should open source that, and I was like, well, it's a bit too dumb. Which was kind of obnoxious of me, because at the same time, my boss was like, holy shit, it's magic. This was a while ago, and I don't think I have the code any more, but I can at least give the world an explanation of how it worked, so that's what I just did. The magic happens with one line, that looks roughly like this:

parse(sentence).links.size > 30

I think I also had a begin/rescue block in there, because "zomg lol" has zero grammatical linkages, and will in fact make the link-parser crap its pants. One further caveat, the process needed occasional restarts, as the link-parser was an academic research project, not designed for production use, and seemed to experience a memory leak or two every now and then.

By the way, I apologize for the fairly blatant bait-and-switch in my title; I didn't use the extremely lazy AI to directly annoy my co-workers. However, if you want to see how I used extremely lazy AI to directly annoy my former roommate, by making fun of her, you should check out TiffyDialogBot, a Python experiment from 2005.



Update:



...and if you want to see actual code, you should check out SkippyTalkBot.