Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!decvax!ucbvax!LBL-CSAM.ARPA!van From: van@LBL-CSAM.ARPA.UUCP Newsgroups: mod.protocols.tcp-ip Subject: Re: 4.2/4.3 TCP and long RTTs Message-ID: <8612071344.AA03810@lbl-csam.arpa> Date: Sun, 7-Dec-86 08:44:03 EST Article-I.D.: lbl-csam.8612071344.AA03810 Posted: Sun Dec 7 08:44:03 1986 Date-Received: Sun, 7-Dec-86 12:57:03 EST References: <8612070131.AA01808@lbl-csam.arpa> Sender: daemon@ucbvax.BERKELEY.EDU Organization: The ARPA Internet Lines: 204 Approved: tcp-ip@sri-nic.arpa What you observe is probably poor tcp behavior, not antisocial rdp behavior. If the link is lossy or the mean round trip time is greater than 15 seconds, the 4.3bsd tcp throughput degrades rapidly. For long transfers, a link that gives 2.7KB/s throughput with a 1% loss rate, gives 0.07KB/s throughput with a 10% loss rate. (As appalling as this looks, 4.2bsd, TOPS-20 tcp, and some other major implementations that I've measured, get worse faster. The 4.3 behavior was the best of everything I looked at.) I know some of the reasons for the degradation. As one might expect, the failure seems to be due to the cumulative effect of a number of small things. Here's a list, in roughly the order that they might bear on your experiment. 1. There is a kernel bug that causes IP fragments to be generated and ip fragments have only a 7.5s TTL. In the distribution 4.3bsd, there is a bug in the routine in_localaddr that makes it say all addresses are "local". In most cases, this makes tcp use a 1k mss which results in a lot of ip fragmentation. On high loss or long delay circuits, a lot of the tcp traffic gets timed out and discarded at the destination's ip level. The bug fix is to change the line: if (net == subnetsarelocal ? ia->ia_net : ia->ia_subnet) in netinet/in.c to if (net == (subnetsarelocal ? ia->ia_net : ia->ia_subnet)) I also changed IPFRAGTTL in ip.h to two minutes (from 15 to 240) because we have more memory than net bandwidth. 2. The retransmit timer is clamped at 30s. The 4.3 tcp was put together before the arpanet went to hell and has some optimistic assumptions about time. Since the retransmit timer is set to 2 * RTT, an RTT > 15s is treated as 15s. (Last week, the mean daytime rtt from LBL to UCB was 17s.) On a circuit with 2min rtt, most packets would be transmitted four times and the protocol pipelining would be effectively turned off (if 4.3 is retransmitting, it only sends one segment rather than filling the window). When running in this mode, you're very sensitive to loss since each dropped packet or ack effectively uses up 4 of your 12 retries. I would at least change TCPTV_MAX in netinet/tcp_timer.h to a more realistic value, say 5 minutes (remembering to adjust related timers like MSL proportionally). I changed the TCPT_RANGESET macro to ignore the maximum value because I couldn't see any justification for a clamp. 3. It takes a long time for tcp to learn the rtt. I've harped on this before. With the default 4k socket buffers and a 512 byte mss, 4.3 tcp will only try to measure the rtt of every 8th packet. It will get a measurement only if that packet and its 7 predecessors are transmitted and acked without error. Based on trpt trace data, tcp gets the rtt of only one in every 80 packets on a link with a 5% drop rate. Then, because of the gross filtering suggested in rfc793, only 10% of the new measurement is used. For a 15s rtt, this means it takes at least 400 packets to get the estimate from the default 3s to 7.5s (where you stop doing unnecessary retransmits for segments with average delay) and 1700 packets to get the estimate to 14s (where you stop unnecessary retransmits because of variance in the delay). Also, if the minimum delay is greater than 6s (2*TCPTV_SRTTDFLT), tcp can never learn the rtt because there will always be a retransmit canceling with the measurement. There are several things we want to try to improve this situation. I won't suggest anything until we've done some experiments. But, the problem becomes easier to live with if you pick a larger value for TCPTV_SRTTDFLT, say 6s, and improve the transient response in the srtt filter (lower TCP_ALPHA to, say, .7). 4. The retransmit backoff is wimpy. Given that most of the links are congested and exhibit a lot of variance in delay, you would like the retransmit timer to back off pretty aggressively, particularly given the lousy rtt estimates. 4.3 backs of linearly most of the time. The actual intervals, in units of 2*rtt, are: 1 1 2 4 6 8 10 15 30 30 30 ... While this is only linear up to 10, the 30s clamp on timers means you never back off as far as 10 if the mean rtt is >1.5s. The effect of this slow backoff is to use up a lot of your potential retries early in a service interruption. E.g., a 2 minute outage when you think the rtt is 3s will cost you 9 of your 12 retries. If the outage happens while you were trying to retransmit, you probably won't survive it. This is another area where we want to do some experiments. It seems to me that you want to back off aggressively early on, say 1 4 8 16 ... for the first part of the table. It also seems like you want to go linear or constant at some point, waiting 8192*rtt for the 12th retry has to be pointless. The dynamic range depends to some extent on how good your rtt estimator is and on how robust the retransmit part of your tcp code is. Also, based on some modelling of gateway congestion that I did recently, you don't want the retransmit time to be deterministic. Our first cut here will probably look a lot like the backoff on an ethernet. 5. "keepalive" ignores rtt. If you are setting SO_KEEPALIVE on any of your sockets, the connection will be aborted if there's no inbound packet for 6 minutes (TCPTV_MAXIDLE). With a 2m rtt, that could happen in the worst case with one dropped packet followed by one dropped ack. ("Sendmail" sets keepalive and we were having a lot of problems with this when we first brought up 4.3.) A fix is to multiply by t_srtt when setting the keepalive timer and divide t_idle by t_srtt when comparing against MAXIDLE. 6. The initial retransmit of a dropped segment happens, at best, after 3*rtt rather than 2*rtt. If the delay is large compared to the window, the steady state traffic looks like a burst acks interleaved with data, an ~rtt delay, a burst of acks interleaved with data and repeat. 4.3 doesn't time individual segments. It starts a 2*rtt timer for the first segment, then, when the first segment is acked, restarts the timer at 2*rtt to time the next segment. Since the 2nd segment went out at approximately the same time as the first and since the ack for the first segment took rtt to come back, the retransmit time for the 2nd segment is 3*rtt. In the usual internet case of 4k windows and an mss of 512, the probability of a loss taking 3*rtt to detect is 7/8. The situation is actually worse than this on lossy circuits. Because segments are not individually timed, all retransmits will be timed 2*rtt from the last successful transfer (i.e., the last ack that moved snd_una). This tends add the time taken by previous retransmissions into the retransmission time of the the current segment, increasing the mean rexmit time and, thus, lowering the average throughput. On a link with a 5% loss rate, for long transfers, I've measured the mean time to retransmit a segment as ~10*rtt. The preceeding may not be clear without a picture (it sure took me a long time to figure out what was going on) but I'll try to give an example. Say that the window is 4 segments, the rtt is R, you want to ship segments A-G and segments B and D are going to get dropped. At time zero you spit out A B C D. At time R you get back the ack for A, set the retransmit timer to go off at 3R ("now" + 2*rtt), and spit out E. At 3R the timer goes off and you retransmit B. At 4R you get back an ack for C, set the retransmit timer to go off at 6R and transmit F G. At 6R the timer goes off, you retransmit D. [D should have been retransmitted at 2R.] Even if we count the retransmit of B delaying everything by 2R (in what is essentially a congestion control measure), there is an extra 2R added to D's retransmit because its retransmit time is slaved to B's ack. Also note that the average throughput has gone from 8 packets in 2R (if no loss) to 8 packets in 7R, a factor of four degradation. The obvious fix here is to time each segment. Unfortunately, this would add 14 bytes to a tcpcb which would then no longer fit in an mbuf. So, we're still trying to decide what to do. It's (barely) possible to live within the space limitations by, say, timing the first and last segments and assuming the segments were generated at a uniform rate. 7. the retransmit policy could be better. In the preceeding example, you might have wondered why F G were shipped after the ack for C rather than D. If I'd changed the example so that C was dropped rather than D, C D E F would have been shipped when the ack for B came in (unnecessarily resending D and E). In either case the behavior is "wrong". The reason it happens is because an ack after a retransmit is treated the same way as normal ack. I.e., because of data that might be in transit you ignore what the ack tells you to send next and just use it to open the window. But, because the ack after a retransmit comes 3*rtt after the last new data was injected, the two sides are essentially in sync and the ack usually does tell you what to send next. It's pretty clear what the retransmit policy should be. We haven't even started looking into the details of implementing that policy in tcp_input.c & tcp_output.c. If a grad student would like a real interesting project ... ------------ There's more but you're probably as tired of reading as I am of writing. If none of this helps and if you have any Sun-3s handy, I can probably send you a copy of my tcp monitor (as long as our lawyers don't find out). This is something like "etherfind" except it prints out timestamps and all the tcp protocol info. You'll have to agree to post anything interesting you find out though... Good luck. - Van