Friday, January 19, 2007

Background on Symbols

One of the things that takes a bit of getting used to in Ruby is the symbol datatype. It seems like they're the same thing as strings, but they're not. There's a really good explanation on O'Reilly's Ruby blog, but it gets into such an involved digression on metaprogramming while setting up its examples that it's almost more useful for understanding metaprogramming instead.

Another thing that helps -- maybe helps more -- is to look at the background. As far as I can tell, Ruby's symbols are a nearly direct transliteration of Smalltalk's symbols. There's a good post on lazy initialization in Smalltalk where Ramon Leon says, "Symbols are often used where strings might be used in languages that lack them." That's a very partisan description of a language feature, but it also really tells you everything you need to know -- if you take the time to think it through.

Take a quick look at Ramon's example code in C# and Smalltalk. Unfortunately you'll have to actually click the link, because Blogger has a number of flaws that make it essentially not useful for a code blog, and I obviously chose it pretty randomly, since a programmer whose blog is not useful as a code blog cannot be said to be somebody who researched all available blogging platforms before making his choice. But let's skate past that, since it obviously highlights my foolishness, and get to the point.

The point is that in the C# example, Ramon uses a string ("AddressSearch"), and in the Smalltalk example, he uses a symbol (#AddressSearch). Just for completeness, the Smalltalk #AddressSearch is pretty much identical to the Ruby :address_search, the only differences being capitalization and trivial syntax.

Now, I don't know C#, but I do know Java, and I do know further, that in Java, if you read Josh Bloch, you don't write your Java that way. Using Strings as flags indicating state is a big, big no-no in Bloch's book, and there's a very good reason. It's an extremely bad habit. If you have a piece of code which uses Strings to represent state, and this code is invoked very frequently, your program will create a new String with the exact same content every time it needs to represent that state. In a performance-sensitive application, the overhead will be painful. Painful in the sense of ridiculously wasteful object initialization and garbage collection, and painful also in the sense that it is ridiculously simple to avoid the problem in the first place.

Bloch's solution is to create a Java Class representing the state information, and to make state or type info within such a class static, so that it's only instantiated once, and never garbage collected. In one of his books he describes creating a very significant performance improvement with this very simple change.

It's pretty easy to see that what Bloch is actually doing is adding symbols to Java. These objects are instantiated once, and since they remain instantiated, if the program needs to represent state the same way again, it doesn't have to create a new String each time. Likewise, you don't have to deal with the pecularities of String comparison -- which in Java has more pitfalls than you might think -- because when you compare these objects to each other, you're not comparing different String objects that might hold the same value, you're simply comparing two objects which either are exactly the same object, or not. That's a lot faster for the JVM, it's less error-prone for developers, and it's also exactly what you get for free with symbols in Smalltalk and Ruby. Even better, you don't have to watch a mission-critical app slow to a grinding halt and give you pathetic performance, at a consulting gig where you're representing the company behind the language, to figure all this out. It's built in from day one. You never even need to think about it.

Ramon's statement, that strings are often used for symbols in languages that lack them, is partisan but totally true. In fact, not only is it true, but further, programmers working in languages without symbols really have to choose. They can either add symbols to their language, or the result will be applications with massively inferior performance. And companies marketing languages which lack symbols have to pay really good programmers lots and lots of money to develop "design patterns" whose sole purpose is bolting on features which Smalltalk had in 1979 (and which, for that matter, Lisp had in the 50s). That's an expensive way to design a language.

Anyway, if you're a Ruby newbie and you're still not quite sure when you should use symbols instead of strings, the answer is really, anywhere that using strings would be a hack. That is to say, if strings are only for text processing, and you're using a string in some internal, programmatic way, such as to represent state or to influence flow control, there you go, that's it, that should be a symbol. Strings are for text processing, and symbols are for everything you used to use strings for because you didn't have symbols.

By the way, the actual subject of Ramon's post -- lazy initialization -- is one place where Ruby makes even Smalltalk look clunky. This is lazy initialization in Smalltalk:

^foo ifNil: [foo := #bar]

And this is lazy initialization in Ruby:

foo ||= :bar


  1. C# guarantees that all string literals will be interned so they basically "act" as symbols, a guaranteed unique instance. However, they still look like strings, a disadvantage when it comes to refactoring.

    Symbols, when used as a separate abstraction, allow different defaults as far as refactoring goes. One can assume two identical symbols can be renamed because the "idea" of a symbol is that it's unique. One can't make that assumption with strings, so it's not safe to rewrite identical strings, since in theory, they may be meant to be different things.

    Static languages tend to use enumerations when available, essentially pre-declared symbols, or single instance classes, Java style, faking enumerations, which fake symbols.

    Your Ruby might be less characters, but it misses the elegance of the Smalltalk example, namely, it requires a special form, an "or" with delayed evaluation (which you have to remember), while the Smalltalk example is just objects and messages, no special forms required, the same old abstraction that then entire language is built from, top to bottom. When Smalltalk does delayed evaluation, [], you know it, without having to already "know it".

  2. Woah, woah, I am not ever arguing one language's superiority over another. Especially not on syntax differences. There are plenty of places where Smalltalk makes Ruby look clunky too. That wasn't my intention at all.

    I don't know C#, but that internal symbolization of strings sounds more useful than what Java does, although I think explicit symbols are probably best of all in the long run.

  3. Oh I know you're not man, I was just joking around.

  4. Java strings are also interned so that all references to "something" refer to the exact same object.

  5. You could be right, Larry, but if you look at Josh Bloch's books, you'll find out the details of the example. Java details are kinda intricate, I don't remember them exactly, but whatever the trap was, it was really easy to fall into.

  6. Ah, you are referring to Item 32 in "Effective Java" I believe. That item mentions to use appropriate data types and not abuse the use of String objects. Not using String instances for enumerated types seems to be the crux of this discussion. Item 21 in the book discusses appropriate solutions to replicate enumerated types in Java before version 5. When Java 5 is being used, the new Enum construct would be used.

    I imagine that Java Enum instances do not replace all of the uses that symbols can provide in Smalltalk and Ruby. But they work well for the simple use case of providing enumerated types at the language level.

  7. I think that's right. Definitely Item 21 from "Effective Java" is right. I might be remembering something from "Java Puzzlers" too, I'm not sure.

  8. I believe symbols are originally from APL (LISP1 didn't have them). `Symbol is a symbol in APL (random fact: O'Caml uses the same syntax)...

    The easiest say to think of symbols in my oppinion is "named unique integer constants". Like enums in C.

  9. Just to clarify, C# (and VB.Net) have enumerations, like in C++. They're effectively type-safe symbols, though I believe any value type can be assigned to their elements at initialization.

    I didn't know that strings were automatically interned in .Net and Java. I remember that in both it used to be recommended that programmers use the Equal() method. Even so, I rarely used it because "==" tended to work in most situations. How do they handle situations like:

    string name = obj.firstName + " " + obj.lastName;

    if (name == "Mark Miller") {...}

    Would that work now?


Note: Only a member of this blog may post a comment.