Profiling Ruby, or, How I Made Rails Start Up Faster

Last week Ruby application startup got about 2-3x faster when four patches of mine were committed to the MRI/CRuby interpreter. The improvements will be in the upcoming Ruby 2.0 release.

The techniques I used to identify what was slow, so I could fix it, are accessible to anyone and are useful for diagnosing any performance problem in MRI or any C extension. But they’re a little obscure, so I thought I’d do my part to make them better known.

Here’s what startup looked like for our frontend Rails application at Solano, before I dug in to fix anything:

That’s 12 seconds from hitting enter on rails c to actually getting a prompt. (11.96s when I measure systematically.) And that’s with the latest 1.9.3 release — with 1.9.2 it’s twice as bad. So I decided to do something about it. That means going through the four basic steps of fixing performance problems:

  1. Measure the problem with a benchmark. As they say, you can’t improve what you don’t measure.
  2. Profile to locate the problem.
  3. Fix the problem.
  4. Repeat until fast.

Steps 1 and 4 are relatively easy. Step 2 can be easy with the right tool — but the right tool isn’t easy to find or to start using. So in this post I’ll go into extra technical detail on Step 2, to make that step easy for you, my reader, in the future.

Step 1: Measure

Why is it so slow? First off, to make the problem easier to study we’ll isolate one thing that shouldn’t be slow but is. Solano’s Rails app requires about 2000 different files, including all its dependencies:

So suppose we isolate just that bit — require a few thousand files, and make the files empty so there’s nothing else going on.

That should be really fast. What actually happens?

Hooray, it’s slow. That means we can focus on fixing this simplified case, and maybe that will fix our real problem.

First, it shouldn’t take a tenth of a millisecond just to require an empty file — that’s 100,000 ns, time enough to fetch 1000 separate items all the way from main memory. We don’t have that much real work to do.

But worse is that the more files we get, the slower and slower we run. We’re taking quadratic time. There’s no reason this shouldn’t be linear time. Why exactly is this so slow?

Step 2: Locate the Problem — Profile

To finger the specific code inside MRI responsible for this sluggish performance, we’ll need a good profiler. The best profiler I’ve found for this purpose is the Linux ‘perf’ tool. It can tell us

  • which C functions we’re spending our time in;
  • which functions called those functions, called those, etc, for the most time;
  • even which code inside each function took time, down to the machine instruction, and annotated with source lines to help us follow what’s what.

Using perf in a basic way is very simple. We run the slow command with perf record:

That writes to a file called (or you can pass a different filename with -o). Then perf report reads that file (or another, with -i) and organizes the results. They come in a nice in-terminal browser and look something like this, showing how much time we spend in each function:

The third column here is the “DSO”. perf is a system profiler, so it captures time spent not only in our own code but in external libraries and the kernel. We’re interested in studying the Ruby interpreter’s own code, so let’s limit to the “ruby” DSO — we can either select it in the perf browser or say perf report -d ruby:

So we’re spending 79.60% of our time in three functions. Two are part of the require implementation, but the biggest is a general-purpose function used all over MRI. To really nail the issue, we should find out what code is responsible for calling rb_string_value_ptr() so much. Solution: call-graph profiling. This takes a little setup but is well worth it.

Step 2, better: Call-Graph Profiling for MRI

When perf records that we’re spending time in a function, it can also record the whole call stack so we can blame the callers too. To do this it needs a little help from us when we compile the program we’re profiling, i.e. the Ruby interpreter — we need to pass the -pg option to gcc. Unfortunately -pg has a side effect which MRI doesn’t digest well, so we also need a patch to MRI to neutralize the side effect. Here’s the full recipe for beautiful call-graph profiles:

  1. Go to your MRI source checkout (git clone if you don’t already have one), pick a directory (say ~/lib/ruby-trunk) to put your Ruby build, and run
  2. Apply my patch to make MRI safe for call-graph profiling:

    (If you’re not on an up-to-date trunk, git am may fail; try patch -p1 instead and commit the result yourself.)
  3. Build Ruby and install it to the $DIR you chose:

    (The -ggdb flag is there by default.)
  4. Run your slow job with perf record -g:

    (If you use rbenv, a quick ln -s $DIR ~/.rbenv/versions/experimental; rbenv shell experimental eliminates the fussing with directories here.)
  5. Run perf report. You can also tune the display with the -g option; see perf help report.

And here are the results. I’ve taken advantage of the in-terminal browser to expand a handful of subtrees and leave others unexpanded.

So the bulk of that time in rb_string_value_ptr() is actually the fault of rb_feature_p(). A summary of the time looks like this:

We’re spending over 3/4 of our time in just one function, rb_feature_p(), either directly or in a couple of functions it calls. So it’s pretty clear what we have to fix.

Step 3: Fix

Our culprit rb_feature_p() is in load.c, which is the file that implements the require method. Here it is (as it was before my patches) stripped to the essential logic, highlighting the two slow calls it makes:

It takes $LOADED_FEATURES, which is the array of all the files we’ve loaded on previous #require calls — thousands of files, in a Rails app with many dependencies — and it iterates through that array, every time #require is called. That’s what makes the startup time slow. And that’s why we slow down as we require more files — because this array gets longer.

Connoisseurs of algorithms and data structures know how that search should work to be faster — instead of searching the whole list of previously loaded files as an array, it should use a hash table.

In this case, there are complications:

  • We’re not looking for an exact string match, but rather for a match accepted by loaded_feature_path(), which is complex. That means a simple hash table with the elements of $LOADED_FEATURES as keys won’t do the job.
  • $LOADED_FEATURES is exposed as part of the Ruby language, so we can’t just ditch it in favor of a more efficient data structure.
  • $LOADED_FEATURES is mutable in Ruby, so Ruby code could modify the array and expect the behavior of require to follow along. In case anybody out there is doing that, we’d like to follow along too. That means we have to somehow keep our better data structure consistent when the array has been modified.

(Whether these semantics are a good idea is another matter. Compare Python’s approach, which is simple to understand and simple to efficiently implement.)

Fortunately, with some close study of the loaded_feature_path() code on one hand and a clever use of a little-used feature of the Array implementation on the other, it’s possible to solve both of these problems. For details see the patches just linked to, or my $LOADED_FEATURES-index patch and overall patch report.

Step 4: Repeat

How fast are we now? In our benchmark:

We’re a lot faster than before — over 4x faster in the large cases, which is pretty nice. But we still get slower and slower the more files we require, still taking quadratic time.

And what about the task I really care about — starting our dependency-laden Rails application? 8.44 seconds. That’s 29% shorter than before, three-and-a-half sweet seconds less waiting time, but we should be able to do better.

So let’s rinse and repeat. What’s still slow in the new version? Here’s the top of the profile, with one item partly expanded to show its origin:

All the functions with ‘gc’, ‘mark’, or ‘sweep’ in their names belong to the garbage collector. Adding them up, we have over 67% of our time consumed by the garbage collector. So if we want to make much more progress, we’re going to have to cut down on that time. That means either making the GC itself more efficient — a tough job, though there is lots of room for improvement — or eliminating whatever allocations are giving it so much exercise.

Finding what those allocations are requires its own set of techniques, in “GC profiling” or “allocation profiling”. In my next post, I’ll write about how to do that and what I did with the results to make Ruby application startup another notch faster.

(You should follow me on Twitter here or Solano here so you see that next post.)


  • Karl Baum November 24, 2012 at 8:50 pm

    This patch looks awesome. Are there instructions for back porting this to ruby 1.9.3?

  • Ruby in Rails September 19, 2013 at 5:40 pm

    Thanks for explaining about profiling in Ruby ! Worthy reading !

2 Trackbacks

Post a Comment