Wednesday, October 8, 2008

Data Integrity And Towelie's Array Methods

Code I wrote for Towelie came in handy for me last night.

Towelie, a very modest refactoring console, revolves around ParseTree, which means it revolves around parse trees - nested lists of nodes representing tokens of Ruby code. Doing a lot of work with these nested arrays, Towelie ended up with a few specialized methods on Array.



The task I had last night: I'm working on a big Rails app with a lot of history and a lot of models. I'm still relatively new to the project. We have a model, which here I'll call a Widget. A Widget has a URL. We've realized we need a validates_uniqueness_of here, because we want one Widget per URL, and people have seen multiple Widgets with the same URL.

But we don't know how many Widgets have the same URL. Likewise, we don't know how Widgets with the same URL are otherwise similar or different. A whole ecology of good testing tools exists around Rails, but the big hole in that ecology is data integrity testing.

(As an aside, if Rails continues to grow, legacy Rails apps with poor data integrity will become an issue, and whoever builds kick-ass data integrity testing tools will guarantee themselves work for years to come. If you're worried about the economy, build some kick-ass data integrity testing tools.)

Anyway, the first thing you need to do in a situation like this is figure out what data you're working with. The first thing I tried was comparing_collect. I use comparing_collect in Towelie to go through lists of nodes and extract nodes which are similar or different in specific ways:



Here homonyms extracts method definitions with identical names, and diff extracts method definitions which only differ by a threshhold amount or less.

I figured I could gather up the Widgets like this:

Widget.find(:all).comparing_collect {|w1, w2| w1 if w1.url == w2.url}

Unfortunately, as you might expect when you're solving data integrity problems for a large app with a long history, the amount of data just killed my computer. I was using a dump of the production database, and there were more than 15,000 Widgets. Ezra Zygmuntovicz said in a talk on optimizing Rails apps that ActiveRecord models can be quite expensive to instantiate, especially the complicated ones. Widgets are big and complicated, and I instantiated 15,000 of them. This could be the reason my computer flipped over on its back, gushed green smoke, and made sad, mewling noises. When I tried to approach the problem a different way, my computer actually ran away from me and hid under the sofa.

Eventually I coaxed it back to the desk. I then went to the opposite extreme and extracted just the URLs with raw SQL. First a file called widgets.sql:

select url from widgets;

Then the command line:

mysql < widgets.sql > widgets.txt

Then I had a list of all the URLs, so I went into IRB and made an array of it. This told me the number of URLs in the system:

File.read("widgets.txt").split("\n").size

And the number of unique URLs:

File.read("widgets.txt").split("\n").uniq.size

The number of unique URLs was about 14,000, meaning I had over 1,000 duplicates. But looking at the URLs as they came back in IRB, I saw a few NULLs in there. Discarding the NULLs brought it down to a little over 500.

Tangent: I once worked for a health insurance claims company and had to hunt down a bug where a guy called John Null (yes really) kept getting his legitimate claims rejected.

Anyway, pretty soon my IRB hackery spilled over into script/console. My IRB and Rails console allow me to work in vi and then eval the code in the console, because of my Utility Belt gem, so I ended up with a nice little script:



Tangent: I took the duplicates bit from Towelie too:



Anyway, the output's proprietary but basically looks like this:



Except there's thousands of lines, and many more attributes. Just a cartoon example here.

Obviously from here you can either scan the output manually, or, if you want to get really crazy, build a tiny Rails app with models based on this output - not Widgets, but URLs, DuplicateWidgets, and their Attributes. Then re-write the script so it formats its output like this:



Then you can just take the script's output, which is itself a script, and run that in the console of your new, tiny Rails app. And you'll have a complete set of metadata concering your legacy database. Totally unnecessary overkill in this instance, but in the event of a truly fucked-up situation, you could use that hypothetical tiny Rails app to chuck all the data into Gruff, which would give you graphs charting the tangledness of your datatbase, enabling you to track your data integrity over time as you cleaned it up.

Anyway, we're off in the happy land of vaporware now. Just a demonstration of some productive hacking and how Towelie's array methods helped.