Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!rochester!cornell!uw-beaver!mit-eddie!husc6!think!ames!ucbcad!ucbvax!TOPAZ.RUTGERS.EDU!hedrick
From: hedrick@TOPAZ.RUTGERS.EDU (Charles Hedrick)
Newsgroups: comp.protocols.tcp-ip
Subject: Ethernet meltdowns
Message-ID: <8707081440.AA00679@topaz.rutgers.edu>
Date: Wed, 8-Jul-87 10:40:32 EDT
Article-I.D.: topaz.8707081440.AA00679
Posted: Wed Jul  8 10:40:32 1987
Date-Received: Sat, 11-Jul-87 06:59:46 EDT
Sender: daemon@ucbvax.BERKELEY.EDU
Distribution: world
Organization: The ARPA Internet
Lines: 179

During the last week or so we have run into several oddities on our
Ethernets that I thought might interest this group.  Nothing that will
surprise any veterans, but sometimes war stories are useful to people
trying to figure out what is going on with their own net.

For several months, we have been having mysterious software problems
on one Ethernet.  THis is our "miscelleanous" network.  No diskless
Suns.  Several Unix timesharing systems, a few VMS machines, a DEC-20,
and some Xerox Interlisp-D machines.  The problems:
  - every week or so, all of our Bridge terminal servers crashed.
	When it happened, they all crashed at the same time.
  - fairly rarely, a Celerity Unix system would run out of mbufs.
  - a Kinetics Ethernet/Appletalk gateway running the kip 
	code would hang or crash (not sure which) every few days

We sent a dump of the Bridge crash to Bridge.  Celerity wouldn't talk
to us because we made a few changes to the kernel.  Kinetics swapped
hardware for us, so we knew it wasn't hardware, but we still haven't
figured out how to debug the problem.  (The author of the software
suspects the Ethernet device driver, but it's going to take us months
to learn enough about the infamous Intel Ethernet chip to find a
subtle device-level problem.  Typical known problem: packet sizes that
are a multiple of 18 bytes hang the hardware when the phase of the
moon is wrong.  How's a bunch of poor Unix hackers gonna debug a
system where the critical chip has a 1/4 inch thick bug list, which we
don't have a copy of.)  Anyway, Bridge finally came back with a
response that unfortunately I have only second-hand "We got a very
high rate of packets from two different Ethernet addresses each
claiming to be the same Internet address.  This shouldn't cause us
problems, but does.  We found the problem, and it will be fixed in the
next release."  They gave us the two Ethernet addresses and the
Internet address.  Two Celerities were claiming to be some other
machine.  So we break out our trusty copy of etherfind.  (This is a
Sun utility that lets you look at packets.  There's a fairly general
way of specifying which ones you want to see, and they will decode the
source, destination, and protocol types for IP.  We've got lots of
Ethernet debugging tools, but this is by far the most useful for this
kind of problem.)  It turns out that the Celerities have the infamous
bug that causes them to get the addresses wrong in ICMP error
messages.  Before proceeding with the war story, let me list the 
classic 4.2 bugs that lead to network problems:

1) Somebody sends to a broadcast address that you don't understand.
There are 6 possible broadcast addresses.  For a subnetted network
128.6.4, they are 255.255.255.255, 128.6.4.255, (the correct ones by
current standards) 128.6.255.255 (for machines that don't know about
subnetting), and the corresponding ones for machines that use the old
standards: 0.0.0.0, 128.6.4.0, and 128.6.0.0.  We have enough of a
combination of software versions that there is no one broadcast
address that all of our machines understand.  So suppose somebody
sends to 128.6.4.255.  Our 4.2 machines, which expect 0.0.0.0 or
128.6.0.0, see this as an attempt to connect to host 255 on the local
subnet.  Since IP forwarding is on by default, they helpfully decide
to forward it.  Thus they issue ARP requests for the address
128.6.4.255.  Presumably nobody responds.  So the net effect is that
each broadcast results in every 4.2 machine on the Ethernet issing an
ARP request, all at the same time.  This causes massive collisions,
and also every machine has to look at all those the ARP requests and
throw it away.  This will tend to cause a momentary pause in normal
processing.

2) Same scenario, but somebody has turned off ipforwarding on all the
4.2 machines.  Alas, this simply causes all the 4.2 machines to issue
ICMP unreachable messages back to the original sender.  This still
results in massive collisions, but at least this time only one machine
(the one that sent the broadcast) has to process the fallout.  That's
if everything works.  Unfortunately, some 4.2 versions have an error
in setting up the headers for the error message.  They forget to
reverse the source and destination, as I recall.

3) Somebody sends a broadcast UDP packet, e.g. routed routing
information.  Hosts that are not running routed (or whatever) attempt
to send back ICMP port unreachable.  They are supposed to avoid
doing this for broadcasts, but the test for broadcastedness in udp_usrreq 
doesn't agree with the one in ip_input, so for certain broadcast
addresses, every machine on the network that isn't running the
appropriate daemon will send back an ICMP error.  Again, lots of
collisions.  If you have a few gateawys running routed, but most
hosts not running it, you'll have network interference every 30
sec.  Then again, there are those machines where the ICMP messages
have the wrong source and destination address.

Now back to the war story.  The case I actually saw with etherfind was
caused by routed broadcasts.  Our 2 Celerities would each respond with
ICMP port unreachable.  Unfortunately, they have the bug that caused
the IP addresses in the ICMP error message to be wrong.  I think it
ended up sending packets with source address == the machine that had
sent the routed's, and destination == the broadcast address.  This
would explain why our Bridge terminal servers were seeing packets from
two different Ethernet addresses, both claiming to be a different
machine.  We had certainly been seeing spotty network response, and as
far as I can see, it went away when we fixed these problems.  As far
as we know, the Bridge terminal servers and Kinetics gateways have
both stopped crashing, and the Celerities have stopped losing mbuf's.
What we suspect is that some obscure case came up that create a
problem more serious than the one we saw with etherfind.  Note that
one of the failure modes is that certain broadcasts can lead to error
messages sent to the broadcast address.  We haven't analysed the code
carefully enough to be sure exactly what conditions trigger it, but we
suspect that the two machines may have gotten into an infinite loop of
error messages.  Since the messages would be broadcasts, everyone on
the network would see them.  This is generally called a "broadcast
storm".  The best guess is that both the Bridge and Kinetics crashes
were caused by subtle bugs in their low-level code that fail under
very heavy broadcast loads.  Probably the Celerity "mbuf leak" is
something similar.  Unfortunately, without a record of the packets on
the network at the exact time of failure, it is impossible to be sure
what was going on.  But Bridge's crash analysis seems to indicate a
broadcast storm involving the Celerities.

The fix to this is to make sure every one of your 4.2 systems has
been made safe in the following fashion:

 - turn off ipforwarding, in ip_input

 - in the routine ip_forward (in ip_input), very near the beginning
	of the routine, there is a test with lots of conditions,
	that ends up throwing away the packet and exiting.  Add
	"! ipforwarding || " to the beginning of the test.

 - in udp_usrreq, a few pages into the routine, in_pcblookup is
	called to see whether there is a port for the UDP packet.
	If not (it returns NULL), normally icmp_error is called
	to send port unreachable.  However there is a test to
	see whether the packet was send as a broadcast.  If so,
	it is simply discarded.  That test must agree with the
	test for broadcastedness in ip_input.  This seems to
	differ in various implementations, so I can't tell you
	the code to use.  One common bug is to forget that
	ip_input recognizes 255.255.255.255 as a broadcast
	address.  It normally does this in a completely different
	place than it tests for other broadcast addresses.
	So you may be able to add something like
	"ui->ui_dst.s_addr == -1 || " to the test in udp_usrreq.

These apply to 4.2.  4.3 probably doesn't need them all, and may not
need any of them.

Now for the second war story.  Our computer center recently bought a
few diskless Suns for staff use.  Until then, all diskless Suns had
been on separate Ethernets separated from our other Ethernets by
carefully-designed IP gateways.  However the computer center figured
that a small number of these things wasn't going to kill their
network, so they connected them to their main Ethernet.  On it is a
VAXcluster (2 8650's), a few 780's, some terminal servers and other
random stuff, and level 2 bridges (Applitek broadband Ethernet bridges
and Ungerman-Bass remote bridges) to more or less everywhere else on
campus.  Since they were still setting up the configuration, it isn't
surprising that a diskless Sun 3/50 got turned on before its server
was properly configured to respond.  Nobody thought anything of this.
We first discovered there were problems when we got a call from
somebody in a building half a mile away that his VAX was suddenly not
doing any useful work.  Then we got a call from our branch in Newark
saying the same thing about their VAXes.  Then someone noticed that
the cluster was suddenly very slow.  Well, it turns out that the Suns
were sitting there sending out requests for their server to boot them.
These were broadcast TFTP requests.  Unfortunately, they used a new
broadcast address, which the Wollongong VMS code doesn't understand.
So VMS attempted to forward them.  This means that it issued an ARP request
for the broadcast address.  There is some problem in the Wollongong
TCP that we don't quite understand yet.  It seems that whenever there
are lots of requests to talk to a host that doesn't respond to ARP's,
the whole CPU ends up being used up in issuing ARP's.  For example,
when something goes wrong with our IBM-compatible mainframe (which is
used to handle most of the printer output for the cluster, using Unix
lpd implementations on both systems) the VAX cluster becomes unusable.
As far as we can tell, it is spending all of its time trying to ARP
the mainframe.  In this case, the same phenomenon was triggered
by the attempt to forward broadcast packets.  Since our VMS systems
mostly sit on networks that are connected by level 2 bridges instead
of real IP gateways, broadcasts go throughout the whole campus, and
essentially every VMS system is brought to its knees.  Unfortunately,
there is no way we can fix this.  The Sun broadcast is being issued
by its boot ROM, which is the one piece of software we aren't equipped
to change, and we don't have source to the Wollongong code.  So the
solution for the moment is to put the Suns on a subnet that is
safely isolated behind an IP gateway.  This fixes the problem, because
IP gateways don't pass broadcasts, or they only pass very carefully
selected ones.