Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/18/84 + RN 4.3; site inset.UUCP
Path: utzoo!watmath!clyde!cbosgd!ihnp4!qantel!dual!lll-crg!seismo!mcvax!ukc!stc!inset!mikeb
From: mikeb@inset.UUCP (Mike Banahan)
Newsgroups: net.misc
Subject: Re: Character sets, sorting etc.
Message-ID: <780@inset.UUCP>
Date: Mon, 4-Nov-85 05:31:39 EST
Article-I.D.: inset.780
Posted: Mon Nov  4 05:31:39 1985
Date-Received: Sun, 10-Nov-85 08:51:49 EST
References: <150@oberon.UUCP>
Reply-To: mikeb@inset.UUCP (Mike Banahan)
Organization: The Instruction Set Ltd., London, UK.
Lines: 34
Xpath: stc stc-a

In article <150@oberon.UUCP> blarson@oberon.UUCP (Bob Larson) writes:
>Sorting order in ASCII realy isn't correct either.  Do you like all of your
>upper case words coming before your lower case ones?  The sorting order
>problem is realy one of replacing a case translator with a table lookup.
>Hopefully the table could be make easy to change for working in different
>languages.

How right you are Bob! There's lots to it as well. The sorting problem is going
to be a famous one - UNIX hackers have sort of got used (sorry about the pun)
to making do with ASCII sorting order, but it's completely unacceptable in a
number of environments. The current  proposals for ISO 8859 mean that only
English has even poor sorting order based on character encoding - for the
other languages that it is meant to support, such as French, Scandinavian
and so on, it's a non-starter. A whole bunch of accented and further
alphabetic characters are found in the ``top'' 128 character positions,
with absolutely no correlation to their expected sorting position.

Some languages confound this by not being very sure about just what
their collating sequence is: see the item posted by Jaap Akkerhuis which
points out that in Dutch, depending on which of 3 more or less official
alphabets you choose, there may or may not be a ``y''. If there is,
it sorts the same as the character PAIR ``ij''. So the algorithms can't
even work on character-by-character basis. Also, my spies tell me that
in French, when two words are compared, accents are ignored  unless the
word is the same without them, when rules are used to separate the two.
Fun stuff, isn't it?

It's going to take some fancy table-driven stuff to make sense of all this!

As for ranges in Regular Expressions ..... I would love to hear how to
make sense of them.
-- 
Mike Banahan, Technical Director, The Instruction Set Ltd.
mcvax!ukc!inset!mikeb