Path: utzoo!attcan!uunet!husc6!purdue!decwrl!tle.dec.com!rmeyers
From: rmeyers@tle.dec.com (Randy Meyers 381-2743 ZKO2-3/N30)
Newsgroups: comp.sys.amiga
Subject: Leo's ANSI C Flame
Message-ID: <8806292138.AA22025@decwrl.dec.com>
Date: 29 Jun 88 21:38:30 GMT
Organization: Digital Equipment Corporation
Lines: 301

Leo recently posted an admitted flame about certain optimizations in
Lattice C V4.0 and the general state of ANSI C.  I believe that Leo
may have been misinformed about some of these subjects.

Leo's posting began by complaining about an optimization in the latest
Lattice C compiler.  Lattice C V4.0 will compile x = strlen("abcdefg")
into an instruction that moves seven to x.

Leo complains:

>	You realize, of course, that this kind of optimization falls flat on
>its face if I somehow manage to change the contents of the memory that
>contains "abcdefg".  I could stuff a \0 where the 'd' is, and the program
>would not notice.

You are correct.  However, such a program is clearly poorly written.  Think,
Leo!  Would you really want to support such a program?  Would you be proud
that you can written it?  Do you think that it helps program clarity if
any constant in the program may have its value changed?

Kernighan and Ritchie never guaranteed the string constants were modifiable.
It was an accident of early implementations that string constants could
be modified, and a very few programmers came to rely on it (probably
again initially by accident).  Note that there is no reason to have
modifiable string constants in the language.  Any program that takes
advantage of modifiable string constants can be rewritten to use:

	static char modifiable[] = "abcdefg";

and take no extra time, no extra space, make it clear to the reader that
the value is not a guaranteed constant, and just be better written.

By the way, the ANSI standard does NOT require that character constants
be read only.  It says that "If a program attempts to modify a string
literal..., the results are undefined."  This is standard jargon for
saying some implementations may write lock constants; some may not.
Your program isn't portable if it depends on this (mis-)feature.  If you
care, only buy compilers from people who agree with your opinions.

>Further, the type returned by strlen() is not *guaranteed* to be an int.
>I could have written one that returns a short; where would that leave you?

Under ANSI C (and Lattice, if it follows the ANSI rules on such things), the
above strlen optimization is only legal if the user is using the real strlen.
The way that the compiler "knows" you are using the real strlen as opposed
to some strlen that you wrote is by the declaration of strlen.  The rule
boils down to "if you got the magic definition of strlen from the proper
include file, the compiler is free to know lots of extra information
about the function and to perform additional optimizations.   If you
provide your own definition of strlen, the compiler must use it and
play dumb."

This magic that happens when you include the proper include file is
this:  ANSI permits a standard include file to contain macros that
are synonyms for standard functions in addition to the normal extern
declarations for the functions.  These macros might generate inline
code instead of calling a function (many pre-ANSI versions of C use
this trick for getc and getchar) or might call a builtin function
that the compiler supports as an extension.

For example, string.h might include:

extern int strlen(const char *);	/* Required by ANSI */
#define strlen(s) _STRLEN(s)		/* Optional, permitted by ANSI */

What the first line does is declare the strlen function that has
always been a part of C.  ANSI C requires every implementation of
C have in its library a function called strlen that does what you
expect.  (The only change here is ANSI C also specifies the
argument type of the function.)

The second line is optional.  It defines a macro to be expanded
when a normal call to strlen is found.  However, the macro does
not do its work by calling the library routine strlen, it uses
the compiler extension _STRLEN to do the work.  The _STRLEN can
either try and determine the result at compile-time, try and generate
code in-line to compute the result, or just give up and call
the library routine.

Note that this is pretty invisible to the programmer.  He only gets
the special _STRLEN function if he includes the proper .h file.  If
the programmer includes the file but does not want to take advantage
of the builtin _STRLEN, he can do:

	#ifdef strlen
	#undef strlen
	#endif

and not be bothered by it.  Even if he doesn't do the #undef, he can
call the library by writing the call as:

	x = (strlen)("abcdefg");

since macros that take arguments do not expand if the next token after
the macro name is not a open parenthesis.  A programmer can even take
the address of the function without worry because of the same rule:

	f = strlen;		/* get a pointer to strlen */

Note that the ANSI standard requires that a programmer be able to
avoid this fancy builtin stuff through the methods I stated.  Although,
in general, programmers need not try to get around this stuff.  Except
for very bizarre programs (like programs that assume that constants
aren't, but are compiled with compilers that assume constants are),
everything works the same.

In part of your argument is the assumption that a programmer is free
to provide his own versions of any standard routine.  This assumption
is in error.  It sometimes works, and it sometimes doesn't.  The ANSI
standard does not really change traditional practice here.

You can provide your own routines if you do not include the standard
header file declaring the function and your make your replacement a
static (non-global) routine.  This has always been true--ANSI doesn't
change it.

What the draft ANSI C standard says about making your own extern function
or variable with the same name as a standard one is "If a program defines
an external identifier with the same name as a reserved external identifier,
even in a semantically equivalent form, the behavior is undefined."  Again,
this is standards jargon saying that it may work, or it may not.  If you
care, only give your money to a compiler writer whose prejudices match
yours.

This is not a change in traditional C practice.  Although, a lot of
misguided people think that this was formally permitted because it does
work much of the time.  Ok, let's assume that you want to write your own
version of strlen that returns a short (assume sizeof (short) is 2)
instead of the standard strlen returns unsigned int (assume sizeof
(unsigned int) is four).  You write some test programs, they all work
fine.  Now you write a program that uses your short strlen and calls
printf.  Guess what, unknown to you, the version of printf that comes
with your compiler calls strlen on string arguments in order to determine
the size of buffers it needs.  Assume that printf now gets horribly wrong
answers from your strlen because if picks up two bytes of garbage along
with the two bytes of result.

Maybe you luck out.  Maybe printf doesn't call strlen.  But you can
probably break just about EVERY C implementation by randomly changing
some of the library functions out from underneath it.  (Does printf
depend on puts? calloc? write? ferror? stdout? fprintf?)  Try
it on our favorite C implementation.  Call up the developer.  Tell
him what you find.  You'll probably get some reply like, "Gosh,
your right.  If you want to rewrite puts, you should also rewrite
printf as well.  Have you looked into buying the source for the
library?  It will make your job easier."

The ANSI standard includes that bit about "semantically equivalent"
to cover two other facts of life.  First, your may think you have
provided a "plug-compatible" version of the routine, but failed in
some needed nuance.  For example, some implementations of malloc
have the property that if you allocate a chunk of memory, free it,
and reallocate it, the original data you stuffed into the memory
will still be there.  I have heard of code that makes use of this
"feature."  Suppose that your malloc doesn't do this, but your
compiler's version of printf requires it.  The other fact of life
is that sometimes several C library functions will end up in the same
module.  Assume that if the linker brings in calloc from the library,
the entry point for malloc is dragged in as well.  If you wanted
to replace malloc with your own routine, but wanted to use the
standard calloc, you will get multiple definitions of malloc
when you link.

All of the above is a fact of life today WITHOUT the ANSI Standard.  The
ANSI Standard actually improves the situation somewhat.  The ANSI
standard does "reserve" the traditional C library names, but it limits
the standard functions to only depend on other standard functions or
to names that begin with underscore.

When I first got my Amiga and Lattice C V3.10, one of the first programs
I tried to build was Wecker's VT100.  It compiled and loaded without errors,
but it would die horribly just after starting.  I eventually tracked
down the bug.  The Lattice fopen function called another (new to V3.10)
Lattice function called dopen.  Wecker had a dopen function in his program
that did something entirely different.  When fopen called dopen, and
entered the Wecker version, not the Lattice version, the program would
die.

This is a problem that has always haunted C, no one said you couldn't have
some standard library routine call some non-standard entry point.
The problem doesn't turn up too often because most standard library
functions can be written using only calls to other standard functions
or to system specific functions with really weird names (_WRITE,
SYS$QIO, $#%&*OUT...).  But occasionally the problem occurs.  Under
the ANSI standard, the problem is outlawed.  If Lattice C had been
standard conforming, the VT100 program would have worked.

So, the ANSI standard doesn't make the situation any worse when it
comes to you writing replacements for standard functions, and it
makes the situation better when it comes to making sure that
standard functions don't tromp all over your functions.

>	You further realize, of course, that no respectable programmer would
>ever write:
> 
>	strlen ("abcdefg");
> 
>	But would instead use (if he really *had* to):
> 
>	sizeof ("abcdefg") - 1;
> 
>	If the code is written by d*psh*ts, it is *not* the responsibility
>of the compiler vendor to save their butts.

Leo, write a macro that takes two arguments.  The first argument is
the name of a struct that has two members, len and ptr.  The
second argument to the macro is a pointer a string.  The macro does
two things: it sets the len member to the length of the string
and the ptr member to the address of the string.  Here's my
answer:

	#define DESC(d, string) (d.len = strlen(string), d.ptr = string)

I actually had to use a similar macro recently.  Look at what happens
when I make a call of the form DESC(d, "abcdefg").  The point here is
that there is no such thing as an optimization for a d*psh*t case.
Experience has shown time and time again that optimizations for what
looks like stupid code are valuable.  Stupid code comes up because
people use macros, because the compiler itself may generate it, or
because powerful optimizations may reduce complex code to a simple
case.  For example:

	register char *p;

	p = "abcdefg";

	/* 100,000 lines of code that don't modify p */

	DESC(d, p);

A reasonably good compiler will prove that p's value has not been changed
since the initial assignment, and will transform the call into
DESC(d, "abcdefg"). With Lattice's strlen optimization, this will boil
down into two moves, instead of a function call and two moves.

>Bloated code is, by and large, the responsibility of the guy who *wrote*
>it.  And if the programmer in question doesn't realize this, then s/he
>has no business writing code for public consumption.
 
As show above, bloated code is sometimes written by nobody--it just sort
of exists in the code written by the best of us.  If an automatic tool,
like an optimizing compiler, can get rid of it painlessly, it is a great
idea.

>'volatile' is a Good Thing.  Function prototypes are a Good Thing.

I agree.

>#pragma is of questionable value (largely because no one has adequately
>explained to me what it *does*!).

Simple:  pragma is a standard approved way to add extensions to the
language without adding new reserved words.  For example, Lattice uses
it in their standard headers in order to call ROM Kernal routines
directly without going through the stubs.  pragma is intrinsically
non-standard:  the ANSI standard states that it exists, mentions
some of the things that it can be used for, and leaves it alone.
Every compiler is free to develop pragmas and use any syntax that
they want after the word pragma.  A programmer who uses a pragma
should enclose it in #if--#endif:

	#if LATTICE
	#pragma Delete(R0,R1)  /* Means delete source file to MANX */
	#endif

I made up the above example.  Lattice's pragma don't look that way
and MANX, as far as I know, doesn't have pragma.

>Enforced parenthetical grouping whether or not it's necessary is Stupid.

Expression control is necessary, but I don't like enforced parentheses
either.  I preferred it when the new unary plus operator controlled
expression evaluation.  However, France threatened to veto the ISO
standard for C unless they got parentheses.  The enforcement only makes
a difference when doing floating point, one's complement math, or checking
for integer overflows.  Since most C implementations (and C programs) use
two's complement integer math with no overflow detection, it isn't a big
thing.

>Making string constants read-only is Stupid.

The ANSI standard doesn't.

>Breaking all the string functions and giving them cryptic names is Stupid.

I agree totally.  But, I don't think that has happened.  The traditional
functions with traditional meanings are around.  Send me mail with what
you think is specifically wrong.

To sum up:  There is a lot of misinformation about ANSI C.  If someone
has told you that all your code will break under ANSI C, either you
are a very poor programmer (and your code breaks every time you move it)
or you are being misinformed.  (The latter is very easy:  the ANSI standard
is written in formal style using certain conventions that make it hard
to decipher.  I have come across lots of misinformation about what
the standard says.)

----------------------------------------
Randy Meyers, not representing Digital Equipment Corporation
	USENET:	{decwrl|decvax|decuac}!tle.dec.com!rmeyers
	ARPA:	rmeyers%tle.dec.com@decwrl.dec.com