Saturday, November 6, 2010

Blog Comment Similarity Detector (Free Code For Disqus)

A little while back, a Disqus plugin annoyed me in some way. I don't remember how, but I do remember that I tweeted Disqus about it, and they fixed it, mostly. I've decided to return the favor.

Yesterday, the MetaOptimize blog post NLP Challenge: Find semantically related terms over a large vocabulary (>1M)? attracted a ton of retweets on Twitter. A Disqus plugin on the blog post adds those RTs to the post as "comments," using a system called BackType.

For instance, in this screenshot, the top "comment" comes from a guy called turian, and the bottom comment retweets him.

It would be easy to eliminate the pure, classic, literal RTs with a regular expression:

next if alleged_comment =~ /$RT @/

But if you look in the middle, there's a nearly identical tweet with no actual "RT" string. That's because it comes from @hntweets, which apparently tweets links found on Hacker News, and is apparently not the only such account. Here's another account which Disqus also posted as a "comment" on the MetaOptimize blog post.

There was a third one, too, which I spotted, and probably others that I didn't. I don't know why so many people want to build Twitter bots that retweet links on Hacker News, but I don't see them stopping any time soon, either. Likewise, systems like BackType are systematically vulnerable to spam and noise; for instance, Disqus picked up my tweet complaining about their signal/noise problem as a "trackback."

A spammer could easily tweet a link to the post along with "ch3ap vi4gra here," and, more to the point, a spammer could easily write a script to identify every site that uses Disqus and another script to tweet "ch3ap vi4gra here" along with a link to every blog post on every such site.

There's also two tweets from the same guy, both referring to the same URL (once via and once directly):

This means that a spammer could easily fill an entire screen with "che4p vi4gra here," just by submitting the same URL to a variety of URL shorteners.

Any code which combs the firehose needs a noise filter, and the question is how to build it. The regular expression solution won't work here. You can't test for string equality, either. If you take a look at the text of these tweets, you'll notice a ton of very minor variations. Here's one in all lower case:

Here's another screenshot from the same blog post, further down the page, where Disqus posted a ton more tweets as alleged "trackbacks." Again, many feature very minor variations, and none are worth reading.

I've had frustrations with Disqus before, so I should probably just solve this problem for myself with a Zepto bookmarklet which auto-hides the HTML ids typically used by Disqus, but that's not much help to Disqus at all, which means it's no way to return a favor. Plus, it's so easy that it's not worth blogging about. For the record, I've looked at the HTML, and it would probably be as simple as:


And the only bit you'd have to think about would be the process of including the Zepto library in a bookmarklet. I've never done that, but it's probably easy.

Because I like a challenge, and I want to create something Disqus can get some use out of, I've written a simple blog comment similarity detector. Since the Disqus code sets up the comments data as JSON, I wrote the similarity detector in Node.js, but I didn't bother to build a server; this is just command-line JavaScript. (Node.js has excellent command-line features.) I used the excellent underscore.js to supplement JavaScript's weaknesses as a language. I used Node v0.2.3, so if you have any trouble running this, just install the old version of Node. (I'm sorry but Node's API changes way too fast for me to give a shit.)

The code knows how to skip highly similar tweets like "foo http://bar" and "RT @baz: foo http://bar", but it doesn't know how to split text strings on hyphens, which would eliminate even more repetition, so if you want to understand this code, get in there and add that trivial feature. The code is simple, brief, and well-commented. It's 146 lines with sample data and comments, 37 lines without. Check it out on GitHub.