Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.2 9/18/84; site brl-tgr.ARPA Path: utzoo!watmath!clyde!burl!ulysses!allegra!bellcore!decvax!genrad!panda!talcott!harvard!seismo!brl-tgr!tgr!Jacob_Palme_QZ%QZCOM.MAILNET@MIT-MULTICS.ARPA From: Jacob_Palme_QZ%QZCOM.MAILNET@MIT-MULTICS.ARPA Newsgroups: net.mail.headers Subject: Checksum as a replacement for missing Message-ID. Message-ID: <9052@brl-tgr.ARPA> Date: Thu, 7-Mar-85 18:44:37 EST Article-I.D.: brl-tgr.9052 Posted: Thu Mar 7 18:44:37 1985 Date-Received: Sun, 10-Mar-85 07:19:49 EST Sender: news@brl-tgr.ARPA Lines: 92 Checksum as a replacement for missing Message-ID. ------------------------------------------------ The Message-ID is a very useful field for many purposes: (a) To preserve In-reply-to references between transferred messages. (b) To stop loops by not accepting the same message to the same recipient more than once. (c) To be able to identify that several copies of the same message are the same, which will save disk space and provide better user functionality in some systems. The problem is that many messages do not have any Message-ID-s. I am planning to modify the COM network mail interface to generate Message-ID-s for messages which lack such ID-s. These generated ID-s will be used internally in COM and will be affixed to a message if it is sent out to the networks again, e.g. by a conference/mailing list residing on a COM system. The Message-ID should uniquely identify one message, so that all copies of the same message will get the same Message-ID. Thus, if two systems independently generate a Message-ID for a message, they should produce the same message. To achieve this goal, I suggest to generate the Message-ID as a checksum of the message. If two systems independently generate a Message-ID for a message, they should preferably produce the same ID. Thus, the ID should *not* refer to the host name of the message system generating the ID, if this is not the system where the message originated. Thus, I propose to generate Message-ID-s of the formatwhere the host name in RFC822 is replaced by the word "CHECKSUM". This will tell recieving systems that this is a CHECKSUM-ed ID, so that they can identify it with other CHECKSUM-ed ID-s. The alternative would be to produce ID-s in the format . However, it does not seem nice to generate ID-s purporting to come from a host which did not in reality generate this ID. Selection of CHECKSUM algorithm: ------------------------------- The algorithm should uniquely identify a message with very low probability of different messages getting the same ID. On the other hand, the checksum should not change for common modifications to a message, like additions of new recipients in the RFC822 header, different line foldings or conversions of TAB-s to SPACE-s. The following algorithm is proposed: The CHECKSUM contains 15 characters, in three groups of five characters. The first group is computed from the name in the FROM field, the second group from the value in the DATE field, the third group is computed from the textual contents of the message. Each group should have a checksum algorithm suitable for that group. For the FROM field, I suggest the following: (a) Use only the value of the "addr-spec" part of the FROM field (delete the "phrase" part and the <>-s, if any). (b) Upcase the characters A-Z before checksum computation. (c) Only include characters A-Z and digits 0-9 in checksum computation. (d) Compute the checksum by summation of the characters, with the weight 1 to the first character, 2 for the second, 4 for the third etc. up to 2^16 for the sixteenth character, then 1 for the seventeenth etc. (e) Take the remainder of the checksum modulo 2^24. Translate this remainder to five characters in a 32-based number system with the digits 0...9, A..V. Rationale: This checksum should be easy to compute on any computer with 32-bit word integer arithmetic. For the DATE field, I suggest as checksum the number (((YEAR+SECOND+MONTH)*31+DAY)*24+HOUR)*60+MINUTE This number is again translated to a five character string as described in (e) above. For the contents of the message, all non-printable characters, including tab and space, should be disregarded when computing the checksum. The checksum is computed using the algorithm in stage (d) and (e) described above (but not stages a-b-c). Rationale: Disregarding all non-printable characters, including tab and space, is necessary to ensure that line folding will not change the checksum.