Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/18/84; site brl-tgr.ARPA
Path: utzoo!watmath!clyde!burl!ulysses!allegra!bellcore!decvax!genrad!panda!talcott!harvard!seismo!brl-tgr!tgr!Jacob_Palme_QZ%QZCOM.MAILNET@MIT-MULTICS.ARPA
From: Jacob_Palme_QZ%QZCOM.MAILNET@MIT-MULTICS.ARPA
Newsgroups: net.mail.headers
Subject: Checksum as a replacement for missing Message-ID.
Message-ID: <9052@brl-tgr.ARPA>
Date: Thu, 7-Mar-85 18:44:37 EST
Article-I.D.: brl-tgr.9052
Posted: Thu Mar  7 18:44:37 1985
Date-Received: Sun, 10-Mar-85 07:19:49 EST
Sender: news@brl-tgr.ARPA
Lines: 92

Checksum as a replacement for missing Message-ID.
------------------------------------------------

The Message-ID is a very useful field for many purposes:

(a) To preserve In-reply-to references between transferred messages.

(b) To stop loops by not accepting the same message to the same
recipient more than once.

(c) To be able to identify that several copies of the same message
are the same, which will save disk space and provide better user
functionality in some systems.

The problem is that many messages do not have any Message-ID-s.

I am planning to modify the COM network mail interface to generate
Message-ID-s for messages which lack such ID-s. These generated ID-s
will be used internally in COM and will be affixed to a message if it
is sent out to the networks again, e.g. by a conference/mailing list
residing on a COM system.

The Message-ID should uniquely identify one message, so that all copies
of the same message will get the same Message-ID. Thus, if two systems
independently generate a Message-ID for a message, they should produce
the same message. To achieve this goal, I suggest to generate the
Message-ID as a checksum of the message.

If two systems independently generate a Message-ID for a message, they
should preferably produce the same ID. Thus, the ID should *not* refer
to the host name of the message system generating the ID, if this is not
the system where the message originated. Thus, I propose to generate
Message-ID-s of the format  where the host
name in RFC822 is replaced by the word "CHECKSUM". This will tell
recieving systems that this is a CHECKSUM-ed ID, so that they can
identify it with other CHECKSUM-ed ID-s.

The alternative would be to produce ID-s in the format . However, it does not seem nice to generate ID-s
purporting to come from a host which did not in reality generate this
ID.

Selection of CHECKSUM algorithm:
-------------------------------

The algorithm should uniquely identify a message with very low
probability of different messages getting the same ID. On the other
hand, the checksum should not change for common modifications to a
message, like additions of new recipients in the RFC822 header,
different line foldings or conversions of TAB-s to SPACE-s.

The following algorithm is proposed:

The CHECKSUM contains 15 characters, in three groups of five
characters. The first group is computed from the name in the FROM
field, the second group from the value in the DATE field, the third
group is computed from the textual contents of the message.

Each group should have a checksum algorithm suitable for that group.

For the FROM field, I suggest the following:

(a) Use only the value of the "addr-spec" part of the FROM field
(delete the "phrase" part and the <>-s, if any).

(b) Upcase the characters A-Z before checksum computation.

(c) Only include characters A-Z and digits 0-9 in checksum computation.

(d) Compute the checksum by summation of the characters, with the
weight 1 to the first character, 2 for the second, 4 for the third
etc. up to 2^16 for the sixteenth character, then 1 for the seventeenth
etc.

(e) Take the remainder of the checksum modulo 2^24. Translate this
remainder to five characters in a 32-based number system with the
digits 0...9, A..V.

Rationale: This checksum should be easy to compute on any computer
with 32-bit word integer arithmetic.

For the DATE field, I suggest as checksum the number
(((YEAR+SECOND+MONTH)*31+DAY)*24+HOUR)*60+MINUTE
This number is again translated to a five character string as
described in (e) above.

For the contents of the message, all non-printable characters, including
tab and space, should be disregarded when computing the checksum. The
checksum is computed using the algorithm in stage (d) and (e) described
above (but not stages a-b-c). Rationale: Disregarding all non-printable
characters, including tab and space, is necessary to ensure that line
folding will not change the checksum.