Thursday, November 30, 2006

Clay Shirky on RIAA & Encryption

Old post, I think from 2003, but utterly excellent. Clay Shirky, it's like if he picks up a bat, you know he's gonna hit it out of the park.

The RIAA Succeeds Where The Cypherpunks Failed

Wednesday, November 29, 2006

Refactoring Rails Code

Today I had a request for a code sample. I kind of went overboard with it and sent three samples. The first, I won't describe it yet. The second was my JavaScript Matrix falling text effect from 2004. The third was some code from a recent project which had some very cool aspects, but in the process of submitting it, I noticed some flaws, and they've been bugging me for hours, so I'm going to clean them up here, and blog it, and maybe it's going to be interesting.

Before I go any further, tho, Blogger seems to be doing all kinds of things that are making this a bit inconvenient, like going insane whenever it sees a less-than sign. So I had to use ? for less-than.

Anyway, here's the original code:

# http://justinfrench.com/index.php?id=122
class UserAreaController ? ApplicationController

def index
end

def drugs
end

def genes
end

def references
end

def render_drugs
drugs, genes = tokenize(params[:function_call])
drugnames = []
drugs.each do |drug|
if drug.is_a? Array
drugnames.push(drug[0])
end
end
# this creates an instance var for each involvement type; e.g., Inducer --> @inducers
Interaction.find(:all, :include => "drug").group_by(&:involvement_type).each do |involvement, list|
list.reject! {|interaction| not drugnames.include? interaction.drug.name.downcase}
instance_variable_set( "@#{involvement.downcase.pluralize}", create_drug_links(list) )
end
end

def render_genes
# we only want to see genes shown in the Flash display
drugs, genes = tokenize(params[:function_call])
all_genes = Gene.find_all
genes_in_display = []
genes.each do |gene|
genes_in_display.push(all_genes.reject {|x| x.name != gene})
end
# hack! the above returns an array of arrays, this fixes it...
@genes = []
genes_in_display.each {|x| @genes.push(x[0])}
end

def render_references
# whatever the central thing in the Flash display is, we only
# want to see references for that
drugs, genes = tokenize(params[:function_call])
@references = []
case params[:function_call][0..0]
when "p" # patients diagram
when "g" # genes diagram
@references = Gene.find_by_name(genes[0]).referencelinks
when "d" # drugs diagram
@references = Drug.find_by_name(drugs[0]).referencelinks
end
end

private
def tokenize(function_call)
drugs = []
genes = []
case function_call[0..0]
when "p" # patients diagram
function_call.scan(/name:'([^']+)',interaction:'([^']+)/).each do |name, interaction|
drugs.push(name)
end
when "g" # genes diagram
matchdata = function_call.match(/geneDrugDiagram\('(.+)',\[/)
genes.push(matchdata[1][3..99])
function_call.scan(/name:'([^']+)',interaction:'([^']+)/).each do |name, interaction|
drugs.push([name, interaction])
end
when "d" # drugs diagram
matchdata = function_call.match(/drugGeneDrugDiagram\('(.+)',\[/)
drugs.push(matchdata[1])
function_call.scan(/name:'([^']+)',interaction:'([^']+)/).each do |name, interaction|
case name
when /CYP|PGP|UGT/ # use phase to identify genes, then...
genes.push(name[3..99]) # remove phase, irrelevant for our purposes here
else
drugs.push([name, interaction])
end
end
end
return drugs, genes
end

end


The first change I made was kind of a show-off thing, using Procs.


case function_call[0..0]
when "p" # patients diagram
get = Proc.new {|name, interaction| drugs.push(name)}
when "g" # genes diagram
matchdata = function_call.match(/geneDrugDiagram\('(.+)',\[/)
genes.push(matchdata[1][3..99])
get = Proc.new {|name, interaction| drugs.push([name, interaction])}
when "d" # drugs diagram
matchdata = function_call.match(/drugGeneDrugDiagram\('(.+)',\[/)
drugs.push(matchdata[1])
get = Proc.new do |name, interaction|
case name
when /CYP|PGP|UGT/ # use phase to identify genes, then...
genes.push(name[3..99]) # remove phase, irrelevant for our purposes here
else
drugs.push([name, interaction])
end
end
end
function_call.scan(/name:'([^']+)',interaction:'([^']+)/).each &get
return drugs, genes
end


This eliminates the repetition of the regular expression. Good thing. But as soon as I did it, I realized it would be nicer if I could do something like:


function_call.scan(/name:'([^']+)',interaction:'([^']+)/).each filter


...where filter would be a method just returning whichever Proc was most appropriate.

I also noticed some really obvious flaws.

The first thing, if you look at the action names, they go like this: index, genes, drugs, references; render_genes, render_drugs, render_references. So the obvious thing is to separate those render_* actions. Make it a RenderController.

The next thing is that right now it uses a system where the UserAreaController inherits from ApplicationController and then controllers in the user area inherit from UserAreaController. That approach is actually very much discouraged by the Rails core team. Additionally, I've seen it work perfectly on one server and then blow up on another server, with no obvious reason for the change in behavior. So, the first question is, how do you preserve that utility, while using a different approach?

I don't have time to test this part, but my theory is you can do it with modules named UserArea and AdminArea, and instead of inheriting from UserAreaController or AdminAreaController, you just mix in the area-specific code with include AdminArea or whatever.

Anyway, let's say for the sake of argument we get that sorted out and we have it nicely refactored down to this.


class RenderController ? ApplicationController
include UserArea

def drugs
drugs, genes = tokenize(params[:function_call])
drugnames = []
drugs.each do |drug|
if drug.is_a? Array
drugnames.push(drug[0])
end
end
# this creates an instance var for each involvement type; e.g., Inducer --> @inducers
Interaction.find(:all, :include => "drug").group_by(&:involvement_type).each do |involvement, list|
list.reject! {|interaction| not drugnames.include? interaction.drug.name.downcase}
instance_variable_set( "@#{involvement.downcase.pluralize}", create_drug_links(list) )
end
end

def genes
# we only want to see genes shown in the Flash display
drugs, genes = tokenize(params[:function_call])
all_genes = Gene.find_all
genes_in_display = []
genes.each do |gene|
genes_in_display.push(all_genes.reject {|x| x.name != gene})
end
# hack! the above returns an array of arrays, this fixes it...
@genes = []
genes_in_display.each {|x| @genes.push(x[0])}
end

def references
# whatever the central thing in the Flash display is, we only
# want to see references for that
drugs, genes = tokenize(params[:function_call])
@references = []
case params[:function_call][0..0]
when "p" # patients diagram
when "g" # genes diagram
@references = Gene.find_by_name(genes[0]).referencelinks
when "d" # drugs diagram
@references = Drug.find_by_name(drugs[0]).referencelinks
end
end

private
def tokenize(function_call)
drugs = []
genes = []
case function_call[0..0]
when "p" # patients diagram
get = Proc.new {|name, interaction| drugs.push(name)}
when "g" # genes diagram
matchdata = function_call.match(/geneDrugDiagram\('(.+)',\[/)
genes.push(matchdata[1][3..99])
get = Proc.new {|name, interaction| drugs.push([name, interaction])}
when "d" # drugs diagram
matchdata = function_call.match(/drugGeneDrugDiagram\('(.+)',\[/)
drugs.push(matchdata[1])
get = Proc.new do |name, interaction|
case name
when /CYP|PGP|UGT/ # use phase to identify genes, then...
genes.push(name[3..99]) # remove phase, irrelevant for our purposes here
else
drugs.push([name, interaction])
end
end
end
function_call.scan(/name:'([^']+)',interaction:'([^']+)/).each &get
return drugs, genes
end

end


One thing it's easy to clean up, the tokenize() call happens with every method. So let's just put that in a before filter.


class RenderController ? ApplicationController
include UserArea
before_filter :tokenize

...

private
def tokenize(function_call = params[:function_call])
@drugs = []
@genes = []
case function_call[0..0]
when "p" # patients diagram
get = Proc.new {|name, interaction| drugs.push(name)}
when "g" # genes diagram
matchdata = function_call.match(/geneDrugDiagram\('(.+)',\[/)
@genes.push(matchdata[1][3..99])
get = Proc.new {|name, interaction| @drugs.push([name, interaction])}
when "d" # drugs diagram
matchdata = function_call.match(/drugGeneDrugDiagram\('(.+)',\[/)
@drugs.push(matchdata[1])
get = Proc.new do |name, interaction|
case name
when /CYP|PGP|UGT/ # use phase to identify genes, then...
@genes.push(name[3..99]) # remove phase, irrelevant for our purposes here
else
@drugs.push([name, interaction])
end
end
end
function_call.scan(/name:'([^']+)',interaction:'([^']+)/).each &get
end

end


That was pretty easy; the only complications, had to add a default parameter for tokenize(), and the drugs and genes arrays had to be made into instance variables, and that changed some names in the genes action. No big deal, though. The genes action, that's still pretty hacky, but we can fix it later.

The cool part is, now that there's structure to support it, the Lispy approach becomes viable. To use the Lispy style, all you really need is lambda.


private
def tokenize(function_call = params[:function_call])
@drugs = []
@genes = []
function_call.scan(/name:'([^']+)',interaction:'([^']+)/).each &filter
end

def filter
case params[:function_call][0..0]
when "p" # patients diagram
lambda {|name, interaction| @drugs.push(name)}
when "g" # genes diagram
matchdata = function_call.match(/geneDrugDiagram\('(.+)',\[/)
@genes.push(matchdata[1][3..99])
lambda {|name, interaction| @drugs.push([name, interaction])}
when "d" # drugs diagram
matchdata = function_call.match(/drugGeneDrugDiagram\('(.+)',\[/)
@drugs.push(matchdata[1])
lambda do |name, interaction|
case name
when /CYP|PGP|UGT/ # use phase to identify genes, then...
@genes.push(name[3..99]) # remove phase, irrelevant for our purposes here
else
@drugs.push([name, interaction])
end
end
end
end

end


Now it's still kinda messy, and in fact I'm thinking a more OOP approach might be cleaner. The problem there is that this system has diagrams in Flash, and it really needs diagram objects in ActionScript, on the Flash side, and it may possibly need them in both Ruby and JavaScript as well. But all those diagram objects imply a lot of repetition. That's an interesting question, a complex UI can kind of hammer Rails' super-DRY-ness, but it's a totally different topic, so let's put it aside for now.

Getting back to this code, the reason the Lispy style might not be appropriate here is that there's a lot of dependence on side effects. Even though I am returning lambdas, there are initialization-ish steps that take place prior to the lambdas being returned, and those steps might make more sense in an object instead. Might be less loose ends that way. I'm not sure.

On the other hand, I'm definitely happier with it than I was when I started.

I'd do some more, but Blogger's formatting issues are making me nutty.

Going back to the multiple objects in multiple languages thing, I actually did a little presentation on why I think a complex UI can present problems for Rails' otherwise very elegant structure, and I think it's going to be turned into a podcast, so I'll link it here when that happens.

The funny thing is, I fired the clients I wrote this code for, but I'm probably going to continue working on it for months. The stuff in the podcast, I'll blog about it at some point, I think, it's pretty challenging. It raises some really interesting questions and I don't know what the answers will be. I do know it'll involve Object#to_json.

Anyway, I should point out, the stuff with group_by() and instance_variable_set(), I got a lot of help from the Ruby-Talk mailing list on that one, especially from David Black (of Ruby For Rails). So, some gratitude there, a shout-out. Ruby-Talk in the house. It's actually pretty interesting, so I'm going to take a quick look at it.


# this creates an instance var for each involvement type; e.g., Inducer --> @inducers
Interaction.find(:all, :include => "drug").group_by(&:involvement_type).each do |involvement, list|
list.reject! {|interaction| not drugnames.include? interaction.drug.name.downcase}
instance_variable_set( "@#{involvement.downcase.pluralize}", create_drug_links(list) )
end


What this does is very simple. First line, it finds all interactions, groups them by involvement type, and makes sure the drug objects they're linked to are also passed to the block. Second line prunes irrelevant interactions. Third line creates an instance variable for each involvement type, and connects a drug link based on the list of interactions which have that involvement type. So you automatically get instance vars representing the category you want to deal with, and containing all the relevant data for that category.

I re-used this pattern later on, in another piece of code, which I hope to blog about later. It's a very handy way to build variables for Rails views, because it automatically provides you with only the variables that will actually contain data. The original version didn't include the list-pruning line, so it was only three lines of code, one block. It replaced something like 15 lines, and that's counting whitespace.

The only pitfall with this pattern, of course, is that if you use it, you have to watch out for views which contain references to instance vars which don't exist. I do have a solution for this in the other piece of code based on this idea, but it's a whole nother topic.

Tuesday, November 28, 2006

Rails Scalability: Real-World Solutions

Ezra "Brainsplat" Zygmuntowicz made an absolutely awesome post to the Ruby-Talk list tonight.

If your boss or your clients have asked you about scaling Rails apps, which practically every boss or client does, even the cool ones, you should get his book, forthcoming from the Pragmatic Programmers, but until then you can just quote him from here:


In something like rails you have the session around for state between requests. But you can also run a drb (distributed ruby) daemon to do longer tasks in an asyncronous way to increase speed. In effect offload any time consuming tasks to a background daemon and let the htp request return right away thru an xmlhttprequest. Then polling to check the status of jobs. These daemons can be avaiable to all your ruby processes running your application code.

The best way to obtain high throughput in ruby web applications is to add more processes behind a http or fcgi proxy. This is how rails and other frameworks scale. You add more processes to the cluster and they share state through the database or other means like memcached or drb.

...

There is a erb compatible alternative that is 3 times faster then ERB and 10-15% faster then the C eruby and it is written in pure ruby. Its called erubis:

http://www.kuwata-lab.com/erubis/

I also want to mention a project I am working on. Its called Merb mongrel+erb:

http://merb.devjavu.com/
http://svn.devjavu.com/merb/README

Merb is faster lightweight replacement for ActionPack which is the VC layer for the rails MVC. Merb still uses ActiveRecord for database persistence. But it can also use Og or Mongoose(pure ruby db). It is integrated into mongrel for http serving and has its own controller and view abstraction with sessions filters and erb. It is just a lot smaller and closer to the metal then ActionPack. I wrote it mainly to use in conjusnction with rails applications. To have a small merb app stand in for performance sensative portions of an application.

ActionPack is not thread safe and requires a mutex around the entire dispatch to rails. This can cause problems with file uploads. Because each file upload blocks an entire rails app server for the duration of the upload. This means that if you have numerous users uploading large files all at once, you will need an app server instance for each concurrent upload(!). This was one of the original reasons I made merb. It has its own mime parser and does not use cgi.rb or anything else that makes actionpack non thread safe. So it can process many concurrent file uploads or requests at one time in one multi threaded app server mongrel process. Merb does use a mutex for parts of the request that can be calling out to ActiveRecord code because although ActiveRecord is thread safe, it does not perform better then single threaded mode and does cause some other problems. So all of the header and mime parsing is handled in thread safe sections of the code and only uses a mutex for sections of code that call the database. ActionPack has a mutex around all mime body parsing as well as everything else actionpack does to serve one request.

You mention you would rather build most of your own framework to be closer to the metal. But you may want to look at merb and see if you want to work on it with me. I plan on continuing its development and it is being used in heavy production already. Augmenting rails applications for faster response times and file uploads.

...

I also find that Xen virtualization works very well for scaling ruby applications. Scaling ruby apps usually means adding more application servers and maxing out your database servers. Also caching plays an important role as well. Anything that can be cached to static files or even partial caching or using memcached for expensive sections of code can yield big performance gains. Using a number of Xen virtual machines with a shared filesystem like gfs can make it easy to scale your ruby applications pretty much horizontally. You just end up pushing the persistence into the database, memcached or drb and trying to use the "shared nothing" approach for as many portions of the system as you can.

In an application stack like this adding nodes to the app server cluster is easy and gives you very good scalability up or down. Ruby is really a small part of a technology stack like this. There are lots of other places to optimize performance. We have built a custom Gentoo distribution that is tailored to running ruby application at optimal performance in Xen instances. I hope to release this distro as soon as I get some free time to package it up.

Cheers-

-- Ezra Zygmuntowicz
-- Lead Rails Evangelist
-- Engine Yard, Serious Rails Hosting

(Reposted here with Ezra's permission.)

Traffic Wave Experiments

Buried in the comments of the Kathy Sierra thing I just posted, Tim O'Reilly recommends this kickass little page from the late 90s. Geek modulates his own driving style and eliminates road rage for tons of people.