Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.2 9/18/84 + RN 4.3; site inset.UUCP Path: utzoo!watmath!clyde!cbosgd!ihnp4!qantel!dual!lll-crg!seismo!mcvax!ukc!stc!inset!mikeb From: mikeb@inset.UUCP (Mike Banahan) Newsgroups: net.misc Subject: Re: Character sets, sorting etc. Message-ID: <780@inset.UUCP> Date: Mon, 4-Nov-85 05:31:39 EST Article-I.D.: inset.780 Posted: Mon Nov 4 05:31:39 1985 Date-Received: Sun, 10-Nov-85 08:51:49 EST References: <150@oberon.UUCP> Reply-To: mikeb@inset.UUCP (Mike Banahan) Organization: The Instruction Set Ltd., London, UK. Lines: 34 Xpath: stc stc-a In article <150@oberon.UUCP> blarson@oberon.UUCP (Bob Larson) writes: >Sorting order in ASCII realy isn't correct either. Do you like all of your >upper case words coming before your lower case ones? The sorting order >problem is realy one of replacing a case translator with a table lookup. >Hopefully the table could be make easy to change for working in different >languages. How right you are Bob! There's lots to it as well. The sorting problem is going to be a famous one - UNIX hackers have sort of got used (sorry about the pun) to making do with ASCII sorting order, but it's completely unacceptable in a number of environments. The current proposals for ISO 8859 mean that only English has even poor sorting order based on character encoding - for the other languages that it is meant to support, such as French, Scandinavian and so on, it's a non-starter. A whole bunch of accented and further alphabetic characters are found in the ``top'' 128 character positions, with absolutely no correlation to their expected sorting position. Some languages confound this by not being very sure about just what their collating sequence is: see the item posted by Jaap Akkerhuis which points out that in Dutch, depending on which of 3 more or less official alphabets you choose, there may or may not be a ``y''. If there is, it sorts the same as the character PAIR ``ij''. So the algorithms can't even work on character-by-character basis. Also, my spies tell me that in French, when two words are compared, accents are ignored unless the word is the same without them, when rules are used to separate the two. Fun stuff, isn't it? It's going to take some fancy table-driven stuff to make sense of all this! As for ranges in Regular Expressions ..... I would love to hear how to make sense of them. -- Mike Banahan, Technical Director, The Instruction Set Ltd. mcvax!ukc!inset!mikeb