Tuesday, February 10, 2015

Why is the Hoogle index so out of date?

Summary: Hoogle 4 is out of date. The alpha version Hoogle 5 has fresh code and data every day (and isn't yet ready).

Someone recently asked why Hoogle's index is so out of date. Making the index both more current (updated daily) and larger (indexing all of Stackage) is one of the goals behind my Hoogle 5 rewrite (which still isn't finished). Let's compare the different update processes:

Hoogle 4 updates took about two hours to complete, if they went well, and often had to be aborted. I first compiled the Hoogle binary on the haskell.org machines, which often failed, as typically the version of GHC was very old. Once I'd got a compiled binary, I needed to generate the database, which took about 2 hours, and occasionally failed halfway through. Once I had the new binary and databases I moved everything to correct place for Apache, accepting a small window of downtime during the move. Assuming that worked, I did a few test searches and smiled. Often the new Hoogle binary failed to start (usually failure to find some files, sometimes permissions) and I had to switch back to the old copy. Fixing up such issues took up to an hour. I had a mix of Windows .bat and Linux .sh scripts to automate some of the steps, but they weren't very robust, and required babysitting.

Hoogle 5 updates happen automatically at 8pm every night, take 4 minutes, and have yet to fail. I have a cron script that checks out the latest code and runs an update script. That script clones a fresh repo, compiles Hoogle, builds the databases, runs the test suite, kills the old version and launches the new version. The Hoogle code is all tested on Travis, so I don't expect that to fail very often. The upgrade script is hard to test, but the two failure modes are upgrading to a broken version, or not upgrading. The upgrade script runs checks and fails if anything doesn't work as expected, so it errs on the side of not upgrading. I use Uptime Robot to run searches and check the server is working, along with a canary page which raises an error if no upgrade happens for two days.

Clearly, the Hoogle 5 version update story is better. But why didn't I do it that way with Hoogle 4? The answer is that Hoogle 4 came out over six years ago, and a lot has changed since then:

  • Hoogle 4 is a CGI binary, served through Apache, while Hoogle 5 is a Haskell Warp server. By moving the logic into Haskell, it's far easier for me to configure and manage. Warp was only released on Hackage in 2011.
  • Hoogle 4 runs on the on the main haskell.org server, where my mistakes can easily take out the haskell.org home page (as a result, the haskell.org home page once said "moo" for 10 minutes). Hoogle 5 runs on a dedicated VM where I have root, and no one else runs anything, so I can experiment with settings about swap files, IP tables and cron jobs.
  • My job has provided a lot of practice doing drive-by sysadmining over the last 6 years. I've also had a lot of practice doing critical releases on a nightly basis. In comparison, Hoogle is pretty simple.
  • The revised/rethought approach to Hoogle databases is a lot faster and uses a lot less memory, so it takes under a minute to generate databases, instead of over an hour. That time difference makes it much easier to experiment with different approaches.

When will Hoogle 5 be ready? It doesn't yet do type search, there is no offline version and no API. There are probably lots of other little pieces missing. If you want, feel free to use it now at hoogle.haskell.org. You can still use Hoogle 4 at haskell.org/hoogle, or the more up-to-date FP complete hosted Hoogle 4.

No comments: