Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watmath!clyde!ima!minya!jc
From: jc@minya.UUCP
Newsgroups: comp.mail.misc
Subject: Summary of mail-damage survey.
Message-ID: <425@minya.UUCP>
Date: Sat, 5-Dec-87 09:14:05 EST
Article-I.D.: minya.425
Posted: Sat Dec  5 09:14:05 1987
Date-Received: Thu, 10-Dec-87 02:02:14 EST
Organization: home
Lines: 113

Hello again.  A week or so back I requested info on cases of damage
to files by various mailers.  I got a lot of requests for a summary.
Since the responses have died down, it's about time to summarize.

First, I'd like to thank all the folks who sent entries.  Some of
the mailers out there are truly demented!  I got a lot of yuks from
some of the contributions.

As was to be expected, I got a good collection of flames telling me
why the things I listed were proper.  I don't care about that.  The
point is to make a list of the kinds of damage that might be done.

A lot of us are faced with "How can I get this document to so-and-so 
over on that machine?"  The idea is to get it there, undamaged, in
whatever funny format the word processor uses.  Format translation
is interesting, but it's an unrelated problem.  This can't be handled 
correctly by mailers, anyway, so the ideal situation would be mailers
that forward files undamaged.  The bits that go in should be the bits
that come out the other end.  We aren't very close to that ideal.

To solve the problems with mailers, it is necessary to run some sort
of encoding program (such as uuencode) on the source file, and then
run the inverse decoding program at the receiving end.  In order to
write such programs, it is helpful if we know just what sort of damage
is possible (not likely, not acceptable, not standard, but possible)
from intervening mailers.

Anyhow, off the soapbox and on to the list.  Here's what I have now:

	1.	Occurrences of the string "\nFrom " have '>' inserted before
		the 'F'.  This is from the uucp mailer.

	2.	If the string "\n.\n" occurs, the tail end of the file (starting
		at the '.') is discarded.  Some mailers try to prevent this by
		converting the offending string to "\n..\n".  Both uucp mail and
		sendmail are guilty of this one.

	3.	High-order bits are turned off (or set to parity or randomized). 
		This is usually the fault of a serial-port interface.

	4.	Null bytes are dropped.  Also, strings between a null and the next
		CR or NL may be dropped. This often happens as a side-effect of the 
		"standard" null-terminated string representation in C.

	5.	If a backspace occurs, it and the preceding character are deleted.
		This is also usually do to a serial-port interface.

	6.	ASCII tabs are expanded to some number of spaces.  This may be
		done by just about any piece of hardware or software in the path.

	7.	Spaces and tabs may be replaced by a compressed space count.

	8.	Trailing spaces may be deleted from message lines, or added to
		make lines a multiple of some number (usually 4 or 6).  This
		includes padding null lines (which are illegal on some systems).

  	9.	Truncation or wrapping of long lines.  For instance, BITNET mail
		is 80 column "PUNCH" files sent to a virtual card reader.

  	10.	Silent discarding (or truncating) of mail which is "too long".  
		SendMail has a limit (configurable) of message size, which is 
		usually something like 100K.  Uucp truncates files to 32K on 
		some 16-bit machines. The mail system on [one system] has a 
		limit of 200 lines.

	11. Some mailers add a ^M (CR) to the end of every line; others
		delete ^M before ^J (LF) or wherever it is found.  This is 
		part of the religious debate about whether "lines" should
		be separated by LF or by CR/LF.  Sometimes this conversion
		is actually done by the low-level serial port interface.

	12.	Control chars other than CR, LF, FF, and TAB converted to ?.

There was also the interesting comment:

| And don't forget the worst damage of all - ASCII/EBCDIC translation!
| Since there's no one-to-one mapping, and different sites use different
| translation tables, there's no way you can know what the mail will look
| like when it gets through.  Most commonly caught characters are characters
| in ASCII range 5B-5F and 7B-7F.  And, of course, tabs are expanded to
| spaces and formfeeds are usually lost....  

Another writer listed the characters most likely to be corrupted as:
	{}~`[]|^\"

This one is especially interesting, because it invalidates the uuencode
program. This encoding produces characters in the specified ranges, and
thus uuencoded files may be garbled as they pass through EBCDIC machines.
It would be interesting to learn just what characters (i.e., hex values)
can be safely transferred through ASCII/EBCDIC interfaces.  An encoding
scheme like uuencode could be written using translation tables, if there
are 64 character codes that can be guaranteed reliable in all ASCII/EBCDIC
interfaces.

Can people out there with EBCDIC systems give me some information about
how their translation tables work?  Are there 64 codes that can be trusted
to any ASCII/EBCDIC translators, and will come out the same when fed to
any other EBCDIC/ASCII translator?

To end with a bit of levity:

	==> Mailers that let "From:" addresses like "user@host.UUCP", 
		"host!user", or "user@host.BITNET" escape on to the Internet 
		without fixing the address (e.g., "user@host.UUCP" becomes 
		"user%host.UUCP@gateway.do.main).
  	==> Prepending "host!" to the From: lines of mail passing
  		through the site and going out through UUCP.
  
Maybe I'm being weird, but I really can't see any end user getting 
very excited about such things.

-- 
John Chambers <{adelie,ima,maynard,mit-eddie}!minya!{jc,root}> (617/484-6393)