Wide Finder in C++
Have you been following Tim Bray’s Wide Finder project? It’s an exercise to express a fairly simple task in a manner that will scale across multiple CPU cores. Some of Tim’s initial progress with Erlang, and other contributors in different languages, is quite fascinating.
Like Tim I was also amused at Pete Kirkham’s C++ implementation which was purported to be shorter than an initial attempt by an Erlang expert (in Erlang obviously). However on closer examination it seems that Pete’s C++ implementation was simply handling the I/O portion with simplified parsing and not the subsequent sorting.
So as another C++ coding Kata I decided to have a go. Whereas Tim’s goal was to evaluate different methods of expressing algoritms for parallel computation, mine was a lot more modest: just get it running concisely in C++ and compare performance with the raw Ruby version. Here’s what I came up with.
#include <iostream> #include <boost/iostreams/device/mapped_file.hpp> #include <boost/regex.hpp> using namespace std; const boost::regex get_re("GET /ongoing/When/\\d{3}x/(\\d{4}/\\d{2}/\\d{2}/[^ .]+) "); typedef map<string, unsigned> counts_by_key_t; typedef multimap<unsigned, string> keys_by_count_t; int main(int argc, char *argv[]) { counts_by_key_t counts_by_key; for(int arg = 1; arg < argc; ++arg) { try { boost::iostreams::mapped_file_source mf(argv[arg]); boost::cregex_iterator regi(mf.begin(), mf.end(), get_re), rege; for(; regi != rege; ++regi) { counts_by_key[(*regi)[1].str()] += 1; } } catch (ios::failure e) { cerr << argv[arg] << ": " << e.what() << endl; return 1; } } keys_by_count_t keys_by_count; for(counts_by_key_t::const_iterator i = counts_by_key.begin(); i != counts_by_key.end(); ++i) { keys_by_count.insert(make_pair(i->second, i->first)); } unsigned n = 10; for(keys_by_count_t::reverse_iterator ri = keys_by_count.rbegin(); n && ri != keys_by_count.rend(); ++ri, --n) { cout << ri->first << ": " << ri->second << endl; } return 0; }
So at 42 lines this one really is shorter than the 84-line Erlang version. Not a million miles away from the Ruby version either, in length if not in readability. However, I’ve made some simplifications, or taken some liberties, depending on your point of view:
-
Firstly, Boost is a third-party library and hence this is not standard C++. Given the (general) difficulty of incorporating libraries into C++ apps this might be more of a problem than on other languages where CPAN/RubyForge/CheeseShop/whatever rules supreme. However I would argue that Boost is such an indispensable part of modern C++ development that relying on it is quite acceptable, even for tasks like this.
-
Also, I’m using Boost’s
mapped_file_stream
, which uses a memory map to iterate through the file. This is very useful and quick but obviously places limits on the size of the file. Fortunately it handles oversize files gracefully. -
That
reverse_iterator
should be aconst_reverse_iterator
, but I had to work around a gcc bug. -
C++ generic programming purists may be scoffing at my explicit iteration through the containers. Try as I might I could not formulate a way of using the standard algorithms without signficantly increasing the complexity of the code. Inverting the
counts_by_key
map to create thekeys_by_count
multimap sounds particularly like the sort of thing that should be possible using the standard algorithms, but I was unable to work it out.
Removing the explicit iteration may prove to be a useful exercise, in order to get better parallelization. I can picture a class of STL algorithms which are smart enough to automatically distribute work amongst different worker threads, coordinate their shared state, etc. A parallel_for_each
algorithm perhaps. Fully utilising existing algorithms such as for_each
seems like a necessary first step towards this goal.
My other goal with this was to look at performance. The C++ version above processes the 200MB file in about 4.5 seconds on my laptop (running WinXP, code compiled with MSVC++ 8.0).
By comparison the Ruby version runs in 5.0 seconds. This is pretty damn impressive if you ask me, given that it’s not using memory-mapped IO.
For an encore I would like to convert the code above to use regular file I/O. An initial attempt to use the file_iterator
class (from the Spirit library, also part of Boost) was not hopeful; boosting the runtime up to 30 seconds. I also looked at using the standard C++ iostream iterators but the regex matching needs bidirectional iterators, and they aren’t.
Oh, and don’t blame me that the code isn’t unicode safe. That was part of the brief!
4 Comments