On Thu, 10 Feb 2005, Didier Lebrun wrote:
> At 14:43 08/02/2005 -0800, Fred Wright wrote:
> >On Tue, 8 Feb 2005, Didier Lebrun wrote:
> > > When RFC 1323 extensions are supported by the system, TCP adjusts
> > itself by
> > > calculating the TCP window size as soon as the first ACK comes back... but
> >That's not exactly correct. The default socket buffer size is set
> >independently from any knowledge of the peer's capability, but in the
> >absence of window scaling the usable window is clamped at 65535. That may
> >not even manage to avoid allocating the buffer based on the uselessly
> >larger size.
> I'm not sure to undestand what you mean here. Doesn't TCP adjust the socket
> buffer and the TCP RWIN once it has received some ACKs, allowing it to
> calculate the proper TCP RWIN ?
Nope. The receive window is never adjusted based on network conditions,
since in general the receiver isn't even aware of the RTT (and it down't
necessarily know the bandwidth, either). The sender has a "congestion
window" to limit the amount of outstanding data, but it doesn't affect
the socket buffer size. And attempting to adjust the send window based on
observed delay-bandwidth product is unstable because of the unity-gain
positive feedback loop between window size and queueing delay.
A TCP implementation *might* clamp the socket buffer size to 64K if the
peer refuses window scaling, but I wouldn't count on even that much.
> > > - set the TCP receive window size to a higher value [max
> > bandwidth
> > > in bytes * max latency in sec]
> >It's actually worse than that. Although that's sufficient to avoid
> >window-limited rates in the absence of packet loss, whenever a packet is
> >dropped it takes *two* RTTs (plus the time to trigger fast retransmit) to
> >get it through, and hence any window size smaller than twice the
> >delay-bandwidth product (and then some) will diminish the effectiveness of
> >fast retransmit.
> You might be true. I noticed some problems in case of packets loss,
> especially with WinXP clients, but couldn't figure them out. I'll have to
> get into fast retransmit documentation to fully understand this point.
This doesn't directly relate to fast retransmit, except that without it
the timeout-based retransmissions would involve so much delay that the
issue is moot.
Consider a simplified model, where:
1) Every data packet is ACKed by the receiver.
2) The fisrt duplicate ACK triggers "fast retransmit".
3) Exactly one packet is dropped within the period of interest.
Until the dropped packet comes along, the sender sees data packets going
out, and ACKs returning, with the ACKs lagging a full RTT behind the
sends. So far, one RTT*BW of window is sufficient to keep up.
If the dropped packet is sent at time T, then it's MIA at the reciver at
time T+OWT (one-way time), and hence no ACK is generated. If P is the
time between packets, then the next packet is sent at T+P and received at
T+P+OWT. Its ACK is received by the sender at T+P+RTT, but can only
duplicate the last ACK before the dropped packet, due to the "hole" in the
stream. This simple-minded aggressive sender would immediately retransmit
the lost packet (again at T+P+RTT), which would be received at
T+P+RTT+OWT. The receiver would then send an ACK encompassing that packet
and all subsequent packets, which would be received bt the sender at
T+P+2*RTT. But to avoid becoming window-limited during this second RTT,
the window size would need to be at least 2*RTT*BW+MSS.
Since the normal delayed ACK strategy only ACKs every other packet (when
data is flowing continuously), another MSS needs to be added to that for
the ACK delay. And a higher duplicate ACK threshhold adds additional MSS
units to the requirement.
If more than one packet is dropped within the RTT, performance suffers
badly unless SACK is available, since fast retransmit can't accomodate
> > > - set "Max duplicate ACKs" = 2 (3 on Win98 only)
> >Again that has nothing to do with RFC1323, but instead represents the fast
> >retransmit threshold. RFC2581 recommends 3. Lower values reduce the
> >amount of send buffer needed to avoid window stalls after dropped packets,
> >but increase the risk of unnecessary retransmissions. Also note that this
> >only affects *send* performance.
> You're true in the principle, but I discussed this point with a sat
> technician, who recommanded me to reduce the retransmit threshold, since 2
> RTTs is already quite big in case of sat link, and the risk of having
> packets still arriving later than that is pretty low. I don't remember the
> reason he gave me for the Win98 exception ?
The tech is confused. The duplicate ACK threshhold has nothing to do with
RTTs. ACKs stream in at (typically) half the packet rate of the sends,
and are fully pipelined. The only effect of the higher threshhold is to
delay the retransmission by a few *packet times*.
Meanwhile, there are other things that can cause duplicate ACKs. Since
there's no such thing as not having an ACK in a packet (theoretically
there could be, but that was deemed useless for other reasons), *any*
packet that doesn't ACK new data simply repeats the last ACK. This can
include things such as window updates and "reverse" data. Also, packet
reordering can cause duplicate ACKs without actually losing
packets. Thus, setting the threshhold too low can cause unnecessary
retransmissions of sucecssfully delivered packets.
While it's not impossible that a different threshhold would be better in
some circumstances, I'd be reluctant to take the recommendation of someone
who doesn't even understand the semantics over that of the people who did
extensive testing with network simulations.
Note that there's a "fencepost ambiguity" in the way this parameter is
specified, and that may account for the Win98 difference. As specified by
the RFC, the recommended threshhold of 3 means triggering fast RX on the
third *duplicate* ACK, which is actually the *fourth* consecutive
> > > On FreeBSD, you can adjust a few thing too:
> >But I wouldn't recommend doing this to m0n0wall, since it's rarely a TCP
> >endpoint and often can't afford the RAM.
> > > If you are using FreeBSD's traffic shaping capabilities, you must
> > adjust to
> > > size of the queues too, in order to avoid packets drops when the queue is
> > > full. You can set each download queue to the TCP receive windows size, and
> > > each upload queue to the TCP sendspace. The same for the main pipes
> > > (96Kbytes and 24Kbytes in our case).
> >But the queues aren't what's filling up. The extra data that one has to
> >accomodate is literally "up in the air" (or at least the vacuum). If the
> >traffic shaper is just dealing with packets, it shouldn't care. If it's
> >trying to be clever enough to watch TCP SEQ and ACK numbers but not clever
> >enough to take large RTTs into account, then it's broken. In no case
> >should it ever be necessary to buffer significant data *in the router*.
> >In fact, excessive buffering in routers simply increases the overall
> >delay-bandwidth product (by increasing latency) and thus requires *more*
> >buffering at the endpoints.
> >Theoretically the same argument would apply to the socket buffers, but the
> >problem is that the receiver can't offer window unless it can commit to
> >receiving that amount of data regardless of application behavior. The
> >send-side buffer is needed because it can't be certain that the data has
> >been delivered until it gets the end-to-end acknowledgment.
> My argument might not be relevant for m0n0wall's traffic shaping ? I've not
> studied it enough to tell. On our gateway, we use DUMMYNET + IPFW2 + NATD
> (static firewall) with a principle of pipe sharing, whatever the bandwidth
> is at a given moment, without setting any absolute value for each pipe or
> queue, since I observed that setting an absolute value was increasing the
> latency by approximately +200 to +250 ms. Each client obtains a fair share
> of the whole pipes (upload and download), depending of how many clients are
> using the link simultaneously, with a weight depending on ports numbers. So
> we can have sometimes one client using the full link at it's best capacity,
> and a whole TCP window can get stuck in any queue before TCP stops
> transmitting more. That's why we need to set a whole TCP window size for
> each queue in order to avoid packet drops. I did some experimental
> observations by setting various values and looking at packets drops, and I
> found that some drops did occur when the queue size was under 75% of the
> TCP window, and disappeared above this value. I supposed the difference
> between 75% and 100% was because [MAX ... * MAX ...] is overestimated.
Unfortunately many traffic shaper implementations introduce substantial
delays, even though there's no reason in principle to do so. All that's
*really* needed is to keep some recent history about packets to make the
throttling decisions. For special test purposes, such as simulating a
satlink, indeed real queueing is needed, but not for mere bandwidth
Avoiding packet drops is usually not a proper goal. In the absence of
something smarter like Explicit Congestion Notification, the *only* way a
router can communicate congestion information to the sender is by dropping
packets. It's unfortunate that data has to get thrown away for this, but
that's the way TCP works. In the absence of packet drops, TCP will
continue to make its send window larger and larger, unbounded by anything
but the socket buffer size. But any window larger than "big enough"
contributes nothing to bandwidth, and meanwhile increases the effective
RTT and thus makes TCP respond even more sluggishly to changes in network
conditions. It's quite likely that all you were doing when "optimizing"
the queues to avoid packet drops was to run the congestion window up
against the socket buffer limit.
If the traffic shaper is truly queueing packets for delayed transmission,
then note that large queue sizes will wind up eating up lots of RAM.