Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!decvax!ucbvax!LBL-CSAM.ARPA!van
From: van@LBL-CSAM.ARPA.UUCP
Newsgroups: mod.protocols.tcp-ip
Subject: Re: 4.2/4.3 TCP and long RTTs
Message-ID: <8612081844.AA10642@lbl-csam.arpa>
Date: Mon, 8-Dec-86 13:44:47 EST
Article-I.D.: lbl-csam.8612081844.AA10642
Posted: Mon Dec  8 13:44:47 1986
Date-Received: Wed, 10-Dec-86 02:14:25 EST
Sender: daemon@ucbvax.BERKELEY.EDU
Organization: The ARPA Internet
Lines: 152
Approved: tcp-ip@sri-nic.arpa

I've been told that my 6th & 7th points (4.3bsd retransmit timers
need some work) were incomprehensible.  That's what you get
when you reply to messages at 3am Sunday morning.  Since the
retransmit timer behavior results in the biggest performance
loss (the other problems affect congestion & stability more than
performance), I'll take a crack at explaining it better.

Attached is a picture of the problem.  It is taken directly from
a trace but the window size has been reduced from 8*MSS to 4*MSS
to simplify the drawing.  Time runs down the page.  The time axes
has tick marks at multiples of the round trip time, R.  The sender
is on the left, the receiver on the right.  Seven segment are
sent, labeled A through G.  Two segments, B and D, get lost or 
damaged in transit.  A lower case letter is used for a receiver's
ack (e.g., "g" is the ack for all bytes up to and including the
last byte of segment "G").  A list of all the segments successfully
received so far is in square brackets at the point where each
ack is generated.  Holes in the sequence space are indicated by "-".

All the traffic goes one direction (this was an ftp).  4.3 almost
always sends MSS byte segments and all these were of size MSS (512B).
Because of the 4.3 delayed ack code, the receiver almost always
reports a full size window (4KB in 4.3, 2KB in this example) in an
ack.  All these acks report a 4 MSS (2KB) window.

All sends are timed.  The retransmit timer is set to 2 times the
smoothed round trip time (TCP_BETA * t_srtt).  The timer is set
on each ack that's not a duplicate of a previous ack (i.e., that
changes the "sent but unacknowleged" pointer, snd_una).  If the
timer times out, the segment starting at snd_una is retransmitted
and the timer is restarted at 2*srtt.  Exactly one segment is
retransmitted.  Periodic retransmissions of that segment continue
until it is acked.  When the segment is acked, the retransmit
timer is set to 2*srtt and "normal" behavior resumes (see rfc793
if you're not sure what normal behavior is). 

	  0-| A				(set timer to 2R, send enough
	    | B\			 packets to fill window (4))
	    | C\\
	    | D\\\
	    |  \\*\
	    |   *\ a [A]		(ack A)
	    |     X
	    |    / a [A - C]		(save C but can only ack through A)
	    |   / /
	    |  / /
	 1R-| E /			(A ack received, set timer to 3R,
	    |  \			 ack opens window by 1 so send E)
	    | - \			(duplicate A ack discarded)
	    |    \
	    |     \
	    |      a [A - C - E]	(save E but can only ack through A)
	    |     /
	    |    /
	    |   /
	    |  /
	 2R-| -				(duplicate A ack discarded)
	    | 
	    |
	    |
	    |
	    |
	    |
	    |
	    |
	    |
	 3R-| B				(timer goes off, rexmit first
	    |  \			 unacked segment (B), timer set to 5R)
	    |   \
	    |    \
	    |     \
	    |      c [A B C - E]	("B" fills in sequence space up
	    |     /			 through "C", ack C)
	    |    /
	    |   /
	    |  /
	 4R-| F				(ack of C opens window for 2 more
	    | G\			 segments, timer set to 6R)
	    |  \\
	    |   \\
	    |    \\
	    |     \c [A B C - E F]	(missing D, can only ack through C)
	    |     /c [A B C - E F G]
	    |    //
	    |   //
	    |  //
	 5R-| -/			(duplicate acks for C discarded)
	    | -
	    |
	    |
	    |
	    |
	    |
	    |
	    |
	    |
	 6R-| D				(timer goes off, rexmit first
	    |  \			 unacked segment (D), timer set 8R)
	    |   \
	    |    \
	    |     \
	    |      g [A B C D E F G]	(sequence space complete, ack G)
	    |     /
	    |    /
	    |   /
	    |  /
	 7R-| 


There are two problems here: the gap between 2R & 3R and the fact
that we don't send D at 4R.  The idle time from 2-3 (and from
5-6) happens because our timer is always 2*R from the last useful
ack and is essentially unrelated to when a segment is originally
sent (The code wasn't intended to work this way and on low delay
circuits it works correctly.) We should really be retransmitting
B 2*R from its first transmission (i.e., 1 line after the 2R tick
mark).  It's not too hard to show analytically that this (the
current 4.3 algorithm) "feeds forward" (e.g., the recovery for D
is moved later in time and is more likely to conflict with F,G
recovery) which is why throughput degrades much faster than
linearly with increasing loss rate. 

You can view the late transmission of D two ways.  It could be
another example of the timer problem.  I.e., we should have
retransmitted D 3 lines after the 2R tick.  We held off sending
it then because we thought the network might be congested and we
wanted to send a minimum amount of data until we got back an
indication (the ack) that the congestion had cleared up.  But we
certainly should have sent D at 4R when we got the "c" ack. 

Or you can say that when we get the "c" ack after the
retransmission of B, no packets have been injected into the
network for 2*R.  The ack tells you pretty clearly that the
receiver is missing D. (Either point of view will do the "right"
thing in this case but treating a retransmit ack specially buys
you a bit in one other case). 

If the two problems are corrected, the total time drops from 7R
to 4R (2R is the total time if no packets are lost).  If we don't
do the send-1-packet-on-rexmit congestion control, the total time
drops to 3R, the mimimum possible if one or more packets is
dropped. 

Also, this partly illustrates why I thought Craig's measurements
demonstrated a problem in TCP rather than the superiority of RDP.
Even with EACKs, it takes RDP 3R to send the data if the same two
packets are lost, exactly the same time it takes TCP.  I think I
can show that EACKs aren't a big win until the drop rate is >50%,
if TCP is working as well as it can (that's not to say RDP isn't
a win for other reasons). 

  - Van