[ previous ] [ next ] [ threads ]
 
 From:  "Manuel Kasper" <mk at neon1 dot net>
 To:  <list at m0n0wall dot neon1 dot net>
 Subject:  XP and MPD; PPPoE crashes
 Date:  Wed, 16 Apr 2003 10:52:02 +0200 (CEST)
Hi people,

first of all, I'm happy to report that the problem I mentioned involving
Windows XP and the new PPTP server function in m0n0wall was entirely
related to my screwed up XP setup. On every other XP machine I've tried,
I did not experience the mysterious packet loss issue, and finally,
reinstalling my machine from scratch (with XP), solved the problem. I
still don't know why it only happened in conjunction with MPD/m0n0wall and
not with other PPTP servers (including poptop), but I couldn't care less.

--> There are no known problems in m0n0wall's PPTP server at all. <--

On a side note, I have just tried MPD with RADIUS (to a Microsoft IAS
server), and it works great, so we may just see RADIUS support in
m0n0wall's PPTP server soon... (BTW, IAS is really nice; makes it
possible to authenticate your PPTP VPN users against Active Directory)

Now, the only thing left that involves m0n0wall's stability are the
(once again, mysterious) MPD (and maybe also DHCPD) crashes ("every 3-6
days or so"). Installing truss on my m0n0wall was a good idea, see what
it logged when MPD vanished from the process table once again last
night:

---
Tue Apr 15 23:08:40 2003: SIGNAL 27
Tue Apr 15 23:08:40 2003: SIGNAL 27
Tue Apr 15 23:08:40 2003: process exit, rval = 27
---

Signal 27 is SIGPROF. This (rather old) PR describes exactly the problem
we're experiencing:

http://www.freebsd.org/cgi/query-pr.cgi?pr=23505

Apparently, the SIGPROF that causes MPD (and probably also DHCPD) to die
occasionally is sent from hardclock() in sys/kern/kern_clock.c. It seems
to be related to overflows in the per-process kernel stack that were
"fixed" before by increasing UPAGES. Since 4.6, UPAGES (in
sys/i386/include/param.h) is set to 3 and I'm not sure if increasing the
value further would make the problem go away (or just extend the intervals
between the crashes). Upgrading to 5.0 would fix this problem, but I don't
want to do this at the moment (lower performance, may not be as stable
[may not have this bug but x others in its place ;]).

The bug is nasty in that whatever you do, you have to wait almost a week
to find out if it changed anything (I can't seem to make it crash any
faster by downloading gigabytes of data; that's probably because MPD
doesn't consume much CPU time at all once it's up and running since all
data is processed in the kernel [netgraph]).
I'll see if I can reproduce the same SIGPROF bug with good ole FreeBSD
userland ppp (which consumes helluvalot CPU time because it processes all
data in userland and as such dies much sooner).

Anyway, I think I'll just comment out the psignal() in kern_clock.c and
replace it with a printf as in the bug report to see if the signal really
originates in that file.

The fact that we're seemingly only seeing this on net45xx's could also be
related to the setting of HZ (which has been set to 250 in the m0n0wall
kernel because of PHK's Elan clock code, but is set to 100 on all standard
FreeBSD systems).

Anyway, I'm happy to know that this all means that the problem is most
likely not m0n0wall related (but may be triggered by "exec-from-PHP"). Now
all that's left to be found is a good fix. :)

Greets,

Manuel