[ previous ] [ next ] [ threads ]
 From:  Fred Wright <fw at well dot com>
 To:  m0n0wall dash dev at lists dot m0n0 dot ch
 Subject:  Re: [m0n0wall-dev] Increasing ipfilter NAT/state table sizes by default?
 Date:  Fri, 1 Oct 2004 00:30:54 -0700 (PDT)
On Tue, 14 Sep 2004, Manuel Kasper wrote:

> in case one of you is looking for something to do/help with: I'm
> wondering if it would be a good idea to increase the ipfilter NAT and
> state table sizes by default in m0n0wall, seeing as changing those
> values requires recompiling the kernel. I've only had a quick glance
> at the ip_state and ip_nat code, so please correct me if I'm wrong in
> my following assumptions.
> The state table is a hash table with separate chaining collision
> handling, and the current size (ipfilter defaults) is 5737 buckets
> (IPSTATE_SIZE), although ipfilter limits the number of states held to
> 4013 (IPSTATE_MAX) to keep the load factor below 0.7. Both numbers
> are obviously prime to help with the compression function, modulo
> (although I don't see why IPSTATE_MAX would necessarily have to be
> prime). This means that there's currently a hard limit of 4013
> states, after which it starts to flush TCP connections that have been
> idle for a while.

This is essentially correct.  Since it's a bucket-sorted table, there's no
strict relationship between the number of entries allowed and the size of
the array (though it affects collision probability).  There's also no
reason for the entry limit to be prime.  Perhaps this is a throwback to an
older version of the code that used a linear hash table.

> The NAT table seems to be a different issue. While 127 should be
> enough for the NAT rules list (NAT_SIZE/RDR_SIZE) and probably also
> the hostmap list (HOSTMAP_SIZE), the default for the NAT table itself
> (NAT_TABLE_SZ) is also 127. This looks like another hash table with
> separate chaining, although with no hard limit, so it will just keep
> getting slower as more entries are added, more collisions are
> generated and the load factor exceeds 1.0.

Umm, in the source I have here (ip_nat.h from V3.4.33), the default for
NAT_TABLE_SZ is 2047, or 16383 with the LARGE_NAT defaults.  Note that
neither of these is prime. :-)

The limit on the number of entries (NAT_TABLE_MAX) defaults to 30000, or
180000 with LARGE_NAT.  It's OK that *these* aren't prime.  Even the first
is pretty large for some m0n0wall systems (see below).

> Since both tables only consist of pointers, I'm wondering what keeps
> us from bumping these values considerably? It doesn't look like that
> would affect memory consumption that much as long as the hash buckets
> are unused. I've not checked if the hash functions look like they'd
> make good use of a larger table instead of just clustering up
> everything. The memory for the entries themselves is malloc'd
> dynamically anyway. Although we might have to tune VM_KMEM_SIZE_SCALE
> to make sure that the available physical memory can actually be used.

Yes, the cost is one pointer (4 bytes on a 32-bit machine) per count for
IPSTATE_SIZE, and two pointers per count for NAT_TABLE_SZ (since there are
two hash tables).  The actual entries are only allocated as needed.

I expect that it's fairly common in router applications for almost all
traffic to be NATted, so it makes sense to size the state and NAT tables

There are a number of issues regarding collisions.  First of all there are
three kinds of data that can be compared:

1) The key.  This is the complete data to be matched by the lookup
routine, and isn't even a completely fixed amount of data.

2) The hash.  This is the result of applying the hash function to the key,
in order to compress it into a modest fixed-size value (32 bits in these

3) The index.  This is the hash value modulo the array size.

Because there are three kinds of value, there are two kinds of collision
possible (obviously a "key collision" would be an oxymoron):

1) A hash collision occurs when two key values map to the same hash value.  
With a good hash function, the probability of a hash collision between any
two keys is 1/2^N, where N is the size of the hash in bits.  With a poor
hash function, it can be considerably worse.

2) An index collision occurs when two hash values map to the same index
value.  Again, with a good hash function, the probability of two hashes
colliding is 1/M where M is the array size.  In fact, with a good hash
function, the probability of two *keys* having an index collision is 1/M.
But again, with a poor hash function it can be considerably worse.  One of
the considerations when using a poor hash function is that it's desirable
for likely deltas in hashes to be relatively prime to the array size.  
The easiest way to insure this is to make the array size prime.  However,
there's no need for a prime array size with a good hash.

It's also worth noting that an index collision is substantially less
expensive than a hash collision *if* the hash value is stored in the
entry, since that allows index collisions to be skipped quite quickly just
by comparing hash values.  But if the hash value *isn't* stored in the
entry, then the full-blown key comparison has to be performed for index
collisions as well as for the much less likely hash collisions.

One other thing that can help comparisions with long chains is to move
each matched entry to the front of its chain (at least if it's
sufficiently far in for this to be worthwhile), thereby causing the
"active" entries to cluster ahead of the "stale" entries.

As far as the current code goes:

1) The "hash" function is extremely lame, being basically a simple sum.

2) The hash value isn't stored in the entries.  The state entry has
something referred to as the "hash value", but it's actually the index and
hence useless in helping with collision efficiency.  The NAT entry has no
such field at all.

3) There's no dynamic reordering of the chains, although the chains are
sorted by reverse order of creation, which isn't *too* bad.

The bottom line is that with the current code, the array sizes should be
prime, and should be somewhat generous due to the likelihood and cost of
collisions.  If the code were improved, the arrays could use power-of-two
sizes to avoid the rather expensive modulo operation, and sizes in the
current 2K-4K ballpark, or perhaps even just 1K, would be fine.

> Again, I haven't spent much time trying to understand the details of
> how ipfilter handles this, and I'm simply looking for an answer
> whether it would be wise to bump the state/NAT table sizes by default
> (advantages, disadvantages...), and if so, which values would be
> reasonable. Also, some rough calculations on how much memory is
> needed to handle x states would be handy for documentation purposes.

In the current build, an ipstate_t is 252 bytes, and a nat_t is 140
bytes.  Since the kernel's malloc() rounds those up to powers of 2, the
actual cost of each is 256 bytes.  Each also has an optional additional
structure that can hang off of it, but AFAICT from a cursory look at the
code, this wouldn't happen in the common cases.

Note that with the current parameters, the 30000 NAT entries allowed would
tie up about 7.5MB, which is a bit much on a 32M Soekris.  There's some
code to reduce the limit if it has trouble with allocations, but I suspect
the system would be pretty sick by the time it actually reached that code.
So this limit may actually be too *large* for the Soekris (and perhaps
even WRAP) builds.  The 4013 filter entries would use about 1MB.

All these parameters are designed to be run-time configurable, but there's
apprently no mechanism to actually set them to anything but the
compile-time values.  The array sizes would need to be set before IPFilter
is initialized; the entry limits could be changed at any time (though
reducing them wouldn't automatically remove entries).

A somewhat related issue is whether the NAT and/or state tables are
overload-prone in certain circumstances.  For example, a recent case that
sounded like this trouble seemed to relate to unsuccessful outgoing
attempts to propagate Windows viruses, and the problem was "cured" by
blocking those outgoing connections in m0n0wall.  This suggests that there
may be an excessively long timeout in at least one of the incomplete
connection states.

					Fred Wright