Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!decvax!ucbvax!LBL-CSAM.ARPA!van From: van@LBL-CSAM.ARPA.UUCP Newsgroups: mod.protocols.tcp-ip Subject: Re: 4.2/4.3 TCP and long RTTs Message-ID: <8612081844.AA10642@lbl-csam.arpa> Date: Mon, 8-Dec-86 13:44:47 EST Article-I.D.: lbl-csam.8612081844.AA10642 Posted: Mon Dec 8 13:44:47 1986 Date-Received: Wed, 10-Dec-86 02:14:25 EST Sender: daemon@ucbvax.BERKELEY.EDU Organization: The ARPA Internet Lines: 152 Approved: tcp-ip@sri-nic.arpa I've been told that my 6th & 7th points (4.3bsd retransmit timers need some work) were incomprehensible. That's what you get when you reply to messages at 3am Sunday morning. Since the retransmit timer behavior results in the biggest performance loss (the other problems affect congestion & stability more than performance), I'll take a crack at explaining it better. Attached is a picture of the problem. It is taken directly from a trace but the window size has been reduced from 8*MSS to 4*MSS to simplify the drawing. Time runs down the page. The time axes has tick marks at multiples of the round trip time, R. The sender is on the left, the receiver on the right. Seven segment are sent, labeled A through G. Two segments, B and D, get lost or damaged in transit. A lower case letter is used for a receiver's ack (e.g., "g" is the ack for all bytes up to and including the last byte of segment "G"). A list of all the segments successfully received so far is in square brackets at the point where each ack is generated. Holes in the sequence space are indicated by "-". All the traffic goes one direction (this was an ftp). 4.3 almost always sends MSS byte segments and all these were of size MSS (512B). Because of the 4.3 delayed ack code, the receiver almost always reports a full size window (4KB in 4.3, 2KB in this example) in an ack. All these acks report a 4 MSS (2KB) window. All sends are timed. The retransmit timer is set to 2 times the smoothed round trip time (TCP_BETA * t_srtt). The timer is set on each ack that's not a duplicate of a previous ack (i.e., that changes the "sent but unacknowleged" pointer, snd_una). If the timer times out, the segment starting at snd_una is retransmitted and the timer is restarted at 2*srtt. Exactly one segment is retransmitted. Periodic retransmissions of that segment continue until it is acked. When the segment is acked, the retransmit timer is set to 2*srtt and "normal" behavior resumes (see rfc793 if you're not sure what normal behavior is). 0-| A (set timer to 2R, send enough | B\ packets to fill window (4)) | C\\ | D\\\ | \\*\ | *\ a [A] (ack A) | X | / a [A - C] (save C but can only ack through A) | / / | / / 1R-| E / (A ack received, set timer to 3R, | \ ack opens window by 1 so send E) | - \ (duplicate A ack discarded) | \ | \ | a [A - C - E] (save E but can only ack through A) | / | / | / | / 2R-| - (duplicate A ack discarded) | | | | | | | | | 3R-| B (timer goes off, rexmit first | \ unacked segment (B), timer set to 5R) | \ | \ | \ | c [A B C - E] ("B" fills in sequence space up | / through "C", ack C) | / | / | / 4R-| F (ack of C opens window for 2 more | G\ segments, timer set to 6R) | \\ | \\ | \\ | \c [A B C - E F] (missing D, can only ack through C) | /c [A B C - E F G] | // | // | // 5R-| -/ (duplicate acks for C discarded) | - | | | | | | | | 6R-| D (timer goes off, rexmit first | \ unacked segment (D), timer set 8R) | \ | \ | \ | g [A B C D E F G] (sequence space complete, ack G) | / | / | / | / 7R-| There are two problems here: the gap between 2R & 3R and the fact that we don't send D at 4R. The idle time from 2-3 (and from 5-6) happens because our timer is always 2*R from the last useful ack and is essentially unrelated to when a segment is originally sent (The code wasn't intended to work this way and on low delay circuits it works correctly.) We should really be retransmitting B 2*R from its first transmission (i.e., 1 line after the 2R tick mark). It's not too hard to show analytically that this (the current 4.3 algorithm) "feeds forward" (e.g., the recovery for D is moved later in time and is more likely to conflict with F,G recovery) which is why throughput degrades much faster than linearly with increasing loss rate. You can view the late transmission of D two ways. It could be another example of the timer problem. I.e., we should have retransmitted D 3 lines after the 2R tick. We held off sending it then because we thought the network might be congested and we wanted to send a minimum amount of data until we got back an indication (the ack) that the congestion had cleared up. But we certainly should have sent D at 4R when we got the "c" ack. Or you can say that when we get the "c" ack after the retransmission of B, no packets have been injected into the network for 2*R. The ack tells you pretty clearly that the receiver is missing D. (Either point of view will do the "right" thing in this case but treating a retransmit ack specially buys you a bit in one other case). If the two problems are corrected, the total time drops from 7R to 4R (2R is the total time if no packets are lost). If we don't do the send-1-packet-on-rexmit congestion control, the total time drops to 3R, the mimimum possible if one or more packets is dropped. Also, this partly illustrates why I thought Craig's measurements demonstrated a problem in TCP rather than the superiority of RDP. Even with EACKs, it takes RDP 3R to send the data if the same two packets are lost, exactly the same time it takes TCP. I think I can show that EACKs aren't a big win until the drop rate is >50%, if TCP is working as well as it can (that's not to say RDP isn't a win for other reasons). - Van