Wednesday, October 10, 2007

How The Rubinius Compiler Works: add, @output, And Compiler Specs

I recently wrote a small compiler patch for Rubinius, with massive help from Defiler in #rubinius. My compiler patch seems flawed - I've never done compiler work before - but I've learned just enough to be dangerous. Here's what I know.

The core file I looked at was compiler/bytecode/compiler.rb. The method I was looking at was process_alias(x). If you click the link, you'll see what's going on in there is very simple. The code pops two vars off a stack, and then "adds" a series of small text strings using a method called add. The small strings are instructions.

add appends instructions to a newline-separated string called @output. In the code example linked above, add adds things from the stack and also adds a series of custom instructions. This is a very common idiom in the code for Rubinius' compiler. Very often a process_* method will add two things from its stack, along with a couple custom instructions. I thought the stack might be a call stack, or a context stack - a term I'd glimpsed in a PDF on Smalltalk virtual machines, but not fully understood - but in fact it's a list of s-expressions.

The two things added to @output from the stack in this case are the first two things on the stack, identified as cur and nw. I think that stands for "current" and "next word." The alternate idiom in the code is to name these variables start and fin, and in one place they're name and val. It's a repeating pattern in this code to have a process_* method start by popping two vars off the stack, maybe enough so to justify a refactoring. However, there are plenty of process_* methods that work differently as well.

For a compiler n00b like me, this is all a bit confusing. The good news is that although I still don't know exactly where all this stuff goes further down the line, it is the core information you need to write compiler speces. Rubinius uses a custom RSpec clone. The Rubinius project is essentially on the RSpec bandwagon, but RSpec uses dynamic features unavailable to Rubinius at this time. Obviously when a language implementation is still approaching version 1.0, you're going to be using a subset of the entire language's functionality.

Anyway, if you've played with RSpec, the code will look familiar. Here's a typical sample from spec/compiler/rubinius/compiler_spec.rb. This code really just says that if you compile a given list of s-expressions, you'll get back a given string of instructions. The instructions include some symbolic stuff, but are mostly literal enough that encoder will then translate them into integers (bytecodes) and store them in .rbc files, to be run by Rubinius' bootstrap, the small VM in C which provides the foundation for the more fully-featured VM in Ruby.

This means a compiler change is very easy to spec, because all you have to do is assert that a given list of s-expressions compiles to a given sequence of instructions. (Obviously, it would be a lot harder if you had to debug actual bytecodes.)

Update: thanks to rue and Defiler for corrections.


  1. About your confusion:
    the confusing parts aren't really related to compiler construction. What you're looking at is a SexprProcessor, which is basically a tool shipped with the ParseTree library, which defines the s-expressions.
    If you want to iterate over a ParseTree/s-expressions, you'll need a visitor, and SExprProcessor helps with that.

    In short: you can define process_* methods for the AST node types you want to handle.
    process_alias handles nodes like [:alias, :foo, :bar].

    The reason that you need to do a lot of shifting (x.shift, etc) is a bit of an ugly spot in SexprProcessor. It has nothing to do with stacks, it's simply the AST node the process_alias method gets as argument, formatted as a list [:alias, :foo, :bar].
    So, to get at the 2nd and 3rd argument of the node, you call shift on the x argument. (To make things slightly more confusing, SExprProcessor can be configured to omit the node type in the list, ie. you could get either [:alias, :foo, :bar] or [:foo, :bar] - I'm guessing the 2nd is used here).

    Frankly, this always reminds me of the way Perl (used to?) deal with parameters.

    It'd be nicer to have the process_* methods to have signatures that deconstruct the lists, eg.
    process_alias(from, to)
    Problem is: some of the more complex node types can vary in structure, and lacking overloading or multi-dispatch or pattern matching, this doesn't really work with plain Ruby. (I have a solution hanging around on my HD that uses Pattern Matching for this... need to release it some day - samples can be glanced in this slide set - look for "ParseWeasel" slides:

  2. Stop in on #rubinius. I'm behind on releasing something too, I know how it is, but you should check in on the irc channel, because it could be that your solution may make compiler stuff easier.

  3. Why not just splat the remaining arguments by default, eg:

    type, *args = [:alias, :foo, :bar]
    send "process_#{type}", *args

    And then use

    def process_alias from to

    The mentioned more complex node types can just do

    def process_case(*args)

    And proceed as they currently do.


Note: Only a member of this blog may post a comment.