[ previous ] [ next ] [ threads ]
 
 From:  "Jonathan De Graeve" <Jonathan dot DeGraeve at imelda dot be>
 To:  "Lee Sharp" <leesharp at hal dash pc dot org>
 Cc:  "m0n0wall" <m0n0wall at lists dot m0n0 dot ch>
 Subject:  RE: [m0n0wall] Version 1.22 freeze
 Date:  Thu, 20 Jul 2006 00:16:49 +0200
>All systems have 256 meg, and strong CPU.  All system web consoles are
>responsive, and all logs viable.  You can not tell anything is wrong from
>the console.
 
256MB is ideal and should even work for at least 200-300 concurrent users

>> Can you confirm if this only happens on systems with more then
>> 50concurrent user logins?

>No.  However, now that local managers know a reboot fixes it, it is very
>hard to get "before" snapshots. :-(
Damn it :( 

>> If the http doesn't work anymore: the mini_httpd still seems to run,
>> even if you wait 10seconds the page doesn't show up?

>IE, Firefox, and Opera all timeout.  The error we get called on is "the
>inter net is down."
 
Is it possible to do a sniff on a newly connected machine giving me the TCP snifs to actually see
what's happening?

>> You can try to higher the max number of concurrent sessions from default
>> 16 to 32 for mini_httpd in the config.

>How is this done?
 
This can be done by just going to the CP settings page itself  under the "Maximum concurrent
connections" fields:
 

"This setting limits the number of concurrent connections to the captive portal HTTP(S) server. This
does not set how many users can be logged in
to the captive portal, but rather how many users can load the portal page or authenticate at the
same time!
Default is 4 connections per client IP address, with a total maximum of 16 connections"

Leave the default of 4connections per client (so leave the box empty) and change the total maximum
from empty to 32 or even something higher. I use the general rule: maximum number of users found
into the system divided by 2.5

For example: 100users giving 40.


> Also you can try to kill the minicron ' /usr/local/bin/minicron 60
> /var/run/minicron.pid /etc/rc.prunecaptiveportal' process and to start
> it back up manual using the exec and instead of 60 use 300 (= 300sec) If
> there is a huge number of users and the radius is slow it can happen
> that the timeout values are a little bit too high and that the m0n0wall
> isn't able to update all accounting within a 60sec interval, especially
> when the radius is configured with a delay if has to answer with a
> Access-Reject packet.
> You can change this behaviour by changing the config and adding a
> 'hidden' option key to the captiveportal section:
> <croninterval>300</croninterval>

AFIK, all of the systems doing this are not using RADIUS, but only a splash
page acceptance authentication.  Would this still make a difference?
 
This is actually handfull information and GREAT news for me (sorry ;) ) this means it isn't related
to the radius subroutines I wrote which is REALLY good news to me.
The bad news is, if it doesn't have something todo with the max httpd proc it is something in the
code related to local user authentication which then must be broke somewhere. (so I need to check
all that code because I didn't write the local user manager stuff)
 
You are really 100% sure this only happens on systems WITHOUT radius right?

>> Since for the rest everything is still accessible there isn't a problem
>> with buffers.

>Confusing, isn't it? :-) 
Not really, the more information I get, the more things I can exclude from possible being a cause
and hopefully the closer we get to the real cause and offcourse the fix.

> > How stable is it?  This problem only occurs in heavily used production
> > systems.  If you wish, I can give you access to the to locations that
>> have
> > done this the most.

>> You are talkin gabout heavily used: how many users at min. ?

>All are in hotels.  From 0 to 10 users can come on any time.  Sometimes many
>more.  Mostly porn or music downloads.  Light VPN outbound use from clients
>in the LAN from 5:30 to about 8:00 then all porn. :-)  Business users...

>> Well, it contains bugfixes that ain't incorporated into 1.22, it should
>> be more stable ;)

>I can try it, or wait to see it break again to gather more information.  PS:
>1.23b1 doesn't have a "both" checkbox, does it? ;-)

Since you are talking about having more then 1site with this behaviour I would suggest upgrading 1
to 1.23b1 and leaving the other to gather more information on this issue.
 
PS: No, like I told ya before the authentication subsystem needs to be rewritten from scratch
(deprecating the original one) to allow more flexibility in authentication mechanisms.
Don't expect to have this rewrite within a couple of months.