The Other Kind of Reentrant
Gather around for a tale of adventure in the land of Linux c++ programming.
So there I was. Like a good developer, I was running unit tests on a module that I had been working on. The module was intended to spawn child processes, in order to read and write their stdin and stdout. Such processes are apparently known as co-processes. I had just finished implementing some enhancements to the code which would communicate with the co-processes using a pseudo-terminal instead of pipes, for those applications like
ssh that read and write directly to the terminal. I was pretty pleased with myself, I could drive ssh as a co-process.
Yes, I'm well aware that I had just re-invented
expect. However I had my reasons, namely the fact that the co-process was made available to the rest of the application as a c++ iostream, meaning that I could read and write it just like any other iostream. Besides,
expect also drags in Tcl as a dependency, and I generally try to avoid dependencies where possible.
The unit tests were failing. Intermittently. Urgh.
What was supposed to happen was that I would close the pseudo-terminal master file descriptor. This would cause an EOF in the co-process which would then quit (I was just using
cat as a co-process for unit testing purposes). My test harness would sleep waiting for a quit flag to be set. The quit flag was supposed to be set from within a handler for the
SIGCHLD signal, but that wasn't being called for some reason. Instead, my main routine would wake from sleep, see that the flag still wasn't set, timeout, kill the co-process, and fail the unit test.
At least, that's what was happening 50% of the time. The other 50% of the time, I was getting the signal, setting the flag and passing the test. So I'm thinking some sort of race condition. I added more logging, to try and work out why the signal wasn't arriving.
This made it worse. Instead of failing the unit test, it was now hanging, deadlocked.
The stack trace from the deadlocked unit test harness wasn't very insightful. It consisted of
_dl_sysinfo_int80 at the top, followed by
__lll_mutex_lock_wait and that's about it. Not very enlightening.
strace Points The Way
So then I tried another tack. Use
strace to see if the
SIGCHLD signal was even being delivered to my application. I found it was. I also found that the deadlock was quite reproducible (after a few attempts) and that the deadlock was indeed in a mutex acquisition routine as indicated by the stack trace. In each case, the immediately-preceeding system call was
stat64("/etc/localtime"). Looking back through the syscall history, there were many similar calls, each time followed by an output log entry.
This tipped me off to looking at the logging library that I was using, log4cpp. As it turned out, it was using
localtime, a function that was not on my list (well, W. Richard Stevens' list anyway) of functions that are safe to call from a signal handler. And logging the incoming signal was the first thing that I was doing in my signal handler.
I verified that, after removing all traces of logging from my signal handlers, the unit tests ran perfectly, to completion. But this was still a mildly unsatisfactory explanation. It didn't explain why the signals weren't being delivered or what mutex was locked that could not be acquired by the signal handler. And maybe I just hadn't tested enough to reproduce it? So I went digging further.
localtime_r To The Rescue, Or Not
I came across
localtime_r which was supposed to be a reentrant version of
localtime. On a whim I re-enabled logging in my signal handlers, and converted log4cpp to use
localtime_r instead of
localtime. Result: deadlock.
But this time I was able to get a proper stack trace. Here it is, elided:
#0 0x004fa7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x005e7e5e in __lll_mutex_lock_wait () from /lib/tls/libc.so.6 #2 0x0058ffa9 in _L_mutex_lock_1947 () from /lib/tls/libc.so.6 ... #10 0x0063b840 in __malloc_initialize_hook () from /lib/tls/libc.so.6 ... #15 0x0058dfbd in localtime_r () from /lib/tls/libc.so.6 #16 0x0058dfbd in localtime_r () from /lib/tls/libc.so.6 #17 0x00286c48 in log4cpp::TimeStampComponent::append (...) at PatternLayout.cpp:158 ... #26 0x0805c6e9 in main_impl::signal_handler (signal_id=17) at signal_handler.cpp:24 #27 <signal handler called> #28 0x004fa7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #29 0x005cb593 in __xstat64@GLIBC_2.1 () from /lib/tls/libc.so.6 #30 0x00590118 in __tzfile_read () from /lib/tls/libc.so.6 ... #34 0x0059480f in strftime () from /lib/tls/libc.so.6 #35 0x00286df9 in log4cpp::TimeStampComponent::append (...) at PatternLayout.cpp:175
Some interesting stuff here.
At the top of the stack, our old friend
__lll_mutex_lock_wait but it's being called from within
__malloc_initialize_hook. Ah-ha! It's trying to do a
malloc from within a signal handler. Naughty Naughty!
A bit further up the stack, there's our call to
localtime_r from within log4cpp. As if to prove a point about its own reentrancy,
localtime_r calls itself.
Still further up, we see that the signal handler is indeed being called, but this time within the
stat64 call. It's reading the timezone file or something, from within
strftime. The call to
strftime is just after the call to
localtime_r, but it's pretty easy to see why the deadlock occurs. It's because the signal handler is trying to acquire a low-level lock on the heap that is already acquired by the main application. Result, deadlock.
As for explaining the "missing" signal that I saw previously, I think what happened there was that the malloc actually failed, which caused the signal handler to exit before it could log (or do) anything. I haven't gone back to test this theory though.
The Meaning Of Reentrant
So the main lesson to be learned from all this: reentrant doesn't necessarily mean reentrant with respect to signals. The glibc manual has the complete details and I'm now memorising every last word.
This is not just of interest to those who work in quaint languages like c++. You advanced Python and Ruby guys have access to signal handlers too! Let this be a cautionary tale for you all.