[ previous ] [ next ] [ threads ]
 From:  "Manuel Kasper" <mk at neon1 dot net>
 To:  list at m0n0wall dot neon1 dot net
 Subject:  pb7 out - crashes fixed!!
 Date:  Sun, 20 Apr 2003 15:50:56 +0200 (CEST)
And here's yet another m0n0wall release announcement: pb7r310 is out! And
I'm happy to report that the reason for the MPD/DHCPD/etc. crashes has
finally been found and fixed! Yes, it's true - read on...

First of all, because of all the debugging involved, the only new features

- Diagnostics: Ping function in webGUI (contributed by Bob Zoller)
- WLAN channel auto-select in webGUI (contributed by Bob Zoller)

Now, concerning those crashes... I have reported earlier that they were
due to a SIGPROF being received by MPD (Note: when I refer to MPD, I also
mean any other daemon (e.g. DHCPD) that may have been crashing in earlier
versions of m0n0wall). I've been searching for the reason for the SIGPROF
in some kernel stack corruption issues that were present in earlier
versions of FreeBSD. Unsurprisingly, changing HZ, increasing UPAGES or
removing CPU_ELAN did not help. The reason was much more simple!

I found that PHP internally calls setitimer(2) with ITIMER_PROF (profiling
timer) to enforce the time limits that can be set with max_execution_time
and max_input_time in php.ini. If the time limit is exceeded while
executing a script, the system sends PHP a SIGPROF (profiling timer
alarm). This signal is caught by PHP and tells it to stop executing the
script. Unfortunately, these interval timer values are inherited by all
processes exec'd by PHP. This means that MPD and all other processes
invoked by the m0n0wall boot-time scripts had that interval timer set to
30 seconds (the setting in php.ini). Since ITIMER_PROF decrements in
process virtual time and system time on behalf of the process (NOT
wallclock time!), this means that the processes received a SIGPROF after
having consumed 30 seconds of CPU time. Because those processes do not use
signal(3) to catch or ignore SIGPROF, the default action was executed on
them, which was to terminate the process. Yikes!

Since MPD and DHCPD consume very little CPU time, it was well possible for
them to run for several days until they had consumed their 30 seconds and
got the deadly SIGPROF.

The fix was simply to set max_execution_time=0 in php.ini - thttpd kills
CGIs that run for more than 60 seconds anyway (and no, it doesn't use the
profiling timer :), so that's no problem.

I hope that this change eliminated all sources of instability in m0n0wall
- now that the bug is fixed, I can finally sleep well again and
concentrate on implementing new features. ;)