Wide Finder 2: The Widening
<movie-trailer-guy>
Many months ago he attempted Tim Bray’s first Wide Finder in C++, mainly as a coding exercise. Back then the goal was readability and conciseness. This time … it’s performanceal. </movie-trailer-guy>
Targeting the Hardware of the Future
Computer architecture evolution is currently in the process of changing direction. Instead of CPUs becoming progressively ever faster, they are going “wider”. This means more processing cores, each of which is relatively low-powered. The combined processing power of multiple cores is greater than is achievable with traditional single core architectures, but requires new programming techniques. These techniques are perhaps well-understood at a theory level, but the Wide Finder project is an attempt to put theory into practice with real-world tasks by everyday coders, such as myself.
The goal of a Wide Finder 2 implementation is to produce some simple statistics from a very large (42GB) data file. The target machine is a Sun Fire T2000 with 8 cores (32 threads) and 32 GB of RAM. The task is relatively simple: we have to read the file, which contains web server log entries, and produce some elementary statistics such as the top 10 pages by hit, the top 10 clients, and so forth. Obviously it’s I/O bound … or is it?
I attempted this in (hopefully) idiomatic C++, using the Boost libraries. For those unfamiliar with Boost, you can think of it as the non-standard library, or maybe the unofficial standard libary. Basically if you’re not using Boost libraries today, you soon will be, because many of the them have gone on to form the basis for the TR1 standard library extensions, and hence targeted for inclusion in the next official C++ standard, known as C++0x. Thanks to Boost, my code is completely portable, compiling on both gcc (on my Mac) and Sun Studio compiler (on the T2000) without even a single #ifdef
.
The Results, So Far
So, C++ should be able to whip all those other languages like Java into submission you might think, right? Well, I think the results speak for themselves. My time of 16 minutes is prettymuch in the middle of the pack. Not bad, but not outstanding either, and still behind some of the Java (or JVM-based) implementations. I have a few more optimisations up my sleeve though, so we’ll see how they pan out.
However the clear winner in my view, and hence worthy of much more recognition than it currently enjoys, is OCaml. With a run time of 5 minutes this is perilously close to the raw I/O speed sustainable on this box. Not only that but it was all done with ~150 lines of code which is frankly amazing (especially compared to my ~500 line C++ implementation). So OCaml is definitely a language to look at, in my humble opinion.
Hit the links from that results page for some often fascinating insights from the other Wide Finder implementers.
How To Go Wide
Fundamentally I think my approach is fairly similar to many of the other Wide Finders. This was to divide the input file into chunks, then walk through each using multiple threads. Each thread accumulates statistics in the form of hash tables. The individual hash tables are then merged before finally being sorted to produce the top-10 reports.
I want to share in detail some of the techniques that I used but like I said I’m still refining them. For now let me just point to a couple of key techniques that got me into contention for this project:
-
When it comes to this much data, you can’t afford to copy anything. This means I/O using
mmap
, and doing as much processing “in-place” as possible. For C++ this also means throwing out theiostreams
library (even though it is otherwise quite well suited to this type of task, as demonstrated previously). And even with a 64-bit binary, you really don’t want to mmap an entire 42GB data file into memory, trust me [Update: Or maybe not. See comment below]. So I ended up with some quite ugly code to deal with mmap-ing segments of the input file while respecting chunk (ie line) boundaries and page alignment boundaries. -
Multi-threaded C++ applications are always at risk of experiencing contention, but this is especially so when it comes to memory allocation. Doing too much allocation can kill any parallel processing you do with C++, because
malloc
andfree
both grab global mutexes (or worse, spinlocks) prior to doing their stuff. To solve this problem I ended up using a custom thread-specific memory allocator based on the Boost.Pool library. This minimises the number of times that the thread grabs memory during the normal course of operation. Code forthcoming.
Mad props to Shark, which is part of the Mac OS X developer suite. It pinpointed my bottlenecks very quickly and simply. An indispensable tool, don’t jump over this Shark.
Future Work
So I’ve got some more ideas on how to go even wider, and I’ll update here with the results. Basically it’s applying the above two techniques more thoroughly; minimise copying of data, and minimise memory allocation. (In particular the Boost.Regex library is next on my list to convert to using a thread-specific memory pool). Will let you know how it goes.
7 Comments