Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!ll-xn!mit-eddie!genrad!decvax!minow From: minow@decvax.UUCP (Martin Minow) Newsgroups: comp.lang.c,comp.std.internat Subject: ANSI C -- trigraphs and character sets Message-ID: <106@decvax.UUCP> Date: Sun, 14-Dec-86 11:15:42 EST Article-I.D.: decvax.106 Posted: Sun Dec 14 11:15:42 1986 Date-Received: Tue, 16-Dec-86 02:13:03 EST Lines: 58 Xref: mnetor comp.lang.c:382 comp.std.internat:48 This is one of a collection of comments on the Draft Standard, posted to comp.lang.c for discussion before I mail a final draft to the Ansi C committee. Each message discusses one problem I have found with the Draft Standard that I feel warrants a "no" vote. Note that this message is my personal opinion, and does not reflect on the opinions of my employer. ---- Problem: Page 10, line 1ff. The Standard should recognize the primacy of the ISO Latin 1 character set. Page 10, line 34ff. Trigraphs should be deleted from the standard. ---- Motivation: Page 10, line 1ff. The character set should be defined in terms of ISO Latin 1 (ISO 8859/1, ANSI X3.134.1, ECMA-94). While other character sets may be used, they should be defined with reference to this standard. Latin 1 contains representations for the accented characters needed for many European languages. These representations do not conflict with the characters, such as backslash, that are needed for C syntax. The standard should permit the use of accented characters (positions 12/0 through 15/15) in variable names (noting, however, that this may be non-portable and not requiring it in a conforming compiler). It should also require acceptance of all 255 characters in strings. (Some existing compilers use the 0x80 bit to mark variable substitution in the preprocessor.) A reasonable extension, but not one that I would mandate, would be to accept the Latin 1 multiply and divide signs as equivalents to '*' and '/' and the raised dot as equivalent to period in numeric quantities. Page 10, line 34ff. Trigraphs were added to the standard in order to accomodate European users who currently use the character set positions occupied by # [ \ ] ^ { | } ~. A better solution is offered by the Latin 1 alphabet, which consists of the USASCII 7-bit alphabet augmented by a 128 byte character set containing the ``special'' letters used by most European countries. This standard was prepared jointly by ANSI, ISO, and CBEMA (the European business equipment manufacturers). During the transitional period, users of existing equipment that supports national letters are better served by implementation-specific conversion routines that are external to the C language. These would compose multi-byte sequences into Latin 1 and display Latin 1 characters (using either the representations available on the terminal or fallback composition sequences) The composition process would be external to, and independent of, the C language. It may be provided by the implementation by a #pragma. Note that the standard does not offer the implementor guidance in handling programs that mix trigraph sequences and national letters. As stated, it is clear that the sequence `??/' functions as a backslash. However, it is not clear how the compiler is to treat an input character (assuming 7-bit Ascii) in position 5/12 (having decimal value 92). Is this also a backslash, or is it a national letter (such as the Swedish capital 'O' with two dots)? ---- Martin Minow decvax!minow