18
Sep 08

Why Did Talos Eat My Face?

Shawn has a great post up about some detective work we’ve been doing this week.  He gives me a bunch of credit, but I am convinced that I did the easy parts – he’s the one making our SQLite happier, so send him the laurels.  Nevertheless, Dave asked if I could write down the basic process used to narrow down the regression, which I agreed to do, because I am a sucker.

Back Story

So Shawn has been trying to update our version of SQLite for some time now.  Updating it fixes half a dozen crasher bugs, and should net us some modest perf wins in some places too.  But three times he’s tried to land it, and three times he’s had to back it out because of a sizeable performance regression on linux.  Specifically, Ts was going from about 1550ms to about 2500ms.  I wouldn’t call that a 65% regression or anything, but I would be pretty comfortable calling it a 61.2% regression so, you know, holy jumping tree squirrels.

The thing is, there was no meaningful hit on Windows or Mac, and even on Linux, Shawn didn’t have much of an idea where to start.  I (as sheriff for the day) didn’t want that sitting in our tree, but understood Shawn not wanting to back it out without any notion of whether the regression was even real, let alone what was causing it.

Enter Standalone Talos

As is so often the case, Alice came to the rescue.  A while ago she put together the standalone talos package, and I keep being happy that she did.  Running talos on your local machine is a bit of a noisy way to gather performance data (depending on the test) but it’s a great way to make your Firefox do exactly the same thing as the real talos does, for the purposes of profiling.  Which is what we did.  And it worked.

How We Dissected The Problem in Six Easy Steps, Only Two of Which Involved Cussin’

  1. Follow the standalone talos set up instructions to get your own local talos setup and pointed to a build without the offending patch.
  2. Edit the sample.config file, drop down to the bottom of the file, and delete every test you don’t care about, since a full talos run is a lot of work.  Tp, in particular, is not a fast test.
  3. Run talos with your profiler of choice.  In our case, the venerable strace. Our command looked like this:
    strace -T -o without-patch-strace.log python run_tests.py sample.config
  4. Swear, because your log is way too short, because you weren’t tracing the child processes, and hence were presumably watching some python crap. Try again with:
    strace -Tf -o without-patch-strace-AGAIN.log python run_tests.py sample.config
  5. Swear, because now you have a gigantic log with everything interleaved stupidly. Read the damned man page this time and come up with the -ff argument to generate a separate trace file for each pid:
    strace -Tff -o without-patch-strace-OMG.log python run_tests.py sample.config
  6. Repeat step 5 after pointing Talos at a build with the patch (and presumably changing the output filename).

This will leave you with a large number of trace files, but it’s pretty easy to recognize the ones you care about.  In Shawn’s case, he just grep‘d out the ones that referenced places.sqlite, and behold, he had before and after traces of precisely the same startup sequence that talos was seeing.

From there it was a short trip to see a massive increase in fsync calls, something we did not expect, but which certainly explains the lag, given the problems we’ve had historically with fsync on Linux.


Total execution time, before and after

Total execution time, before and after

Needless to say, Shawn is on it.