Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watmath!clyde!ima!minya!jc From: jc@minya.UUCP Newsgroups: comp.mail.misc Subject: Summary of mail-damage survey. Message-ID: <425@minya.UUCP> Date: Sat, 5-Dec-87 09:14:05 EST Article-I.D.: minya.425 Posted: Sat Dec 5 09:14:05 1987 Date-Received: Thu, 10-Dec-87 02:02:14 EST Organization: home Lines: 113 Hello again. A week or so back I requested info on cases of damage to files by various mailers. I got a lot of requests for a summary. Since the responses have died down, it's about time to summarize. First, I'd like to thank all the folks who sent entries. Some of the mailers out there are truly demented! I got a lot of yuks from some of the contributions. As was to be expected, I got a good collection of flames telling me why the things I listed were proper. I don't care about that. The point is to make a list of the kinds of damage that might be done. A lot of us are faced with "How can I get this document to so-and-so over on that machine?" The idea is to get it there, undamaged, in whatever funny format the word processor uses. Format translation is interesting, but it's an unrelated problem. This can't be handled correctly by mailers, anyway, so the ideal situation would be mailers that forward files undamaged. The bits that go in should be the bits that come out the other end. We aren't very close to that ideal. To solve the problems with mailers, it is necessary to run some sort of encoding program (such as uuencode) on the source file, and then run the inverse decoding program at the receiving end. In order to write such programs, it is helpful if we know just what sort of damage is possible (not likely, not acceptable, not standard, but possible) from intervening mailers. Anyhow, off the soapbox and on to the list. Here's what I have now: 1. Occurrences of the string "\nFrom " have '>' inserted before the 'F'. This is from the uucp mailer. 2. If the string "\n.\n" occurs, the tail end of the file (starting at the '.') is discarded. Some mailers try to prevent this by converting the offending string to "\n..\n". Both uucp mail and sendmail are guilty of this one. 3. High-order bits are turned off (or set to parity or randomized). This is usually the fault of a serial-port interface. 4. Null bytes are dropped. Also, strings between a null and the next CR or NL may be dropped. This often happens as a side-effect of the "standard" null-terminated string representation in C. 5. If a backspace occurs, it and the preceding character are deleted. This is also usually do to a serial-port interface. 6. ASCII tabs are expanded to some number of spaces. This may be done by just about any piece of hardware or software in the path. 7. Spaces and tabs may be replaced by a compressed space count. 8. Trailing spaces may be deleted from message lines, or added to make lines a multiple of some number (usually 4 or 6). This includes padding null lines (which are illegal on some systems). 9. Truncation or wrapping of long lines. For instance, BITNET mail is 80 column "PUNCH" files sent to a virtual card reader. 10. Silent discarding (or truncating) of mail which is "too long". SendMail has a limit (configurable) of message size, which is usually something like 100K. Uucp truncates files to 32K on some 16-bit machines. The mail system on [one system] has a limit of 200 lines. 11. Some mailers add a ^M (CR) to the end of every line; others delete ^M before ^J (LF) or wherever it is found. This is part of the religious debate about whether "lines" should be separated by LF or by CR/LF. Sometimes this conversion is actually done by the low-level serial port interface. 12. Control chars other than CR, LF, FF, and TAB converted to ?. There was also the interesting comment: | And don't forget the worst damage of all - ASCII/EBCDIC translation! | Since there's no one-to-one mapping, and different sites use different | translation tables, there's no way you can know what the mail will look | like when it gets through. Most commonly caught characters are characters | in ASCII range 5B-5F and 7B-7F. And, of course, tabs are expanded to | spaces and formfeeds are usually lost.... Another writer listed the characters most likely to be corrupted as: {}~`[]|^\" This one is especially interesting, because it invalidates the uuencode program. This encoding produces characters in the specified ranges, and thus uuencoded files may be garbled as they pass through EBCDIC machines. It would be interesting to learn just what characters (i.e., hex values) can be safely transferred through ASCII/EBCDIC interfaces. An encoding scheme like uuencode could be written using translation tables, if there are 64 character codes that can be guaranteed reliable in all ASCII/EBCDIC interfaces. Can people out there with EBCDIC systems give me some information about how their translation tables work? Are there 64 codes that can be trusted to any ASCII/EBCDIC translators, and will come out the same when fed to any other EBCDIC/ASCII translator? To end with a bit of levity: ==> Mailers that let "From:" addresses like "user@host.UUCP", "host!user", or "user@host.BITNET" escape on to the Internet without fixing the address (e.g., "user@host.UUCP" becomes "user%host.UUCP@gateway.do.main). ==> Prepending "host!" to the From: lines of mail passing through the site and going out through UUCP. Maybe I'm being weird, but I really can't see any end user getting very excited about such things. -- John Chambers <{adelie,ima,maynard,mit-eddie}!minya!{jc,root}> (617/484-6393)