Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!rutgers!mit-eddie!genrad!decvax!decwrl!sun!amdcad!amd!pesnta!peora!ucf-cs!novavax!houligan!dave@murphy.UUCP From: dave@murphy.UUCP Newsgroups: comp.lang.c Subject: is it really necessary for character values to be positive? Message-ID: <39@houligan.UUCP> Date: Tue, 16-Dec-86 13:29:33 EST Article-I.D.: houligan.39 Posted: Tue Dec 16 13:29:33 1986 Date-Received: Fri, 19-Dec-86 02:00:06 EST Organization: Gould Electronics, Ft. Lauderdale, Florida. Lines: 63 Summary: invent an 8-bit character set and then let some of them be negative Line eater: fully conforming I've been thinking about this business with long chars and short chars and trigraphs and international character sets and such, and I've got a proposal. The proposal is this: if someone can come up with an 8-bit character set that contains all of the necessary characters for the Western languages, (and includes the existing USASCII set as a subset), then let's drop the requirements that a member of a machine's "natural" character set be represented as a positive number in a plain char. This will have the following benefits: 1. Everyone can adopt a character set that will have all of the characters that they need, and not have to overload any of the USASCII set with other characters. Portability of programs and other text files will benefit greatly, and trigraphs will be unnecessary. (For many languages, there aren't enough punctuation characters to overmap; for example, I think that it takes 17 characters to represent all of the possible letter-and-accent combinations in French, and that's just for lower case.) 2. The character set will fit into almost everyone's byte size, meaning no dramatic increase in the size of text files. (Nearly everyone uses at least an 8-bit byte with UN*X; the only ones that I can think of are the PDP10/20's, which can use 7-bit bytes.). 3. It won't be necessary to raise sizeof(char) from 1. This means that programs that use chars for things other than text (yes, there are a *lot* of them) won't be disturbed. 4. Each implementation can continue using the signedness for char that best fits the architechure. It won't be necessary to force all plain chars to unsigned. The disadvantages that I can see are these: 1. Since some of the char values may be negative, it will not be possible to collate chars by simply comparing their values; you have to call a collating routine defined for the particular implementation. (But, some languages don't collate in strict alphabetic order, so you'll wind up doing this with any international character set.) 2. You will have to use functions to do things like converting a letter to upper or lower case; just masking off bits won't get it anymore. 3. Some terminals already use the codes > 127 for other purposes. There is no easy answer to this problem. 4. The value 255 can't be used because it may look like EOF on some systems. In short, it doesn't look to me like there is any good reason to require characters to be represented as positive values. Or have I overlooked something really basic? --- "I used to be able to sing the blues, but now I have too much money." -- Bruce Dickinson Dave Cornutt, Gould Computer Systems, Ft. Lauderdale, FL UUCP: ...!{sun,pur-ee,brl-bmd,bcopen}!gould!dcornutt or ...!{ucf-cs,allegra,codas}!novavax!houligan!dcornutt ARPA: dcornutt@gswd-vms.arpa (I'm not sure how well this works) "The opinions expressed herein are not necessarily those of my employer, not necessarily mine, and probably not necessary."