Date: Fri, 12-Oct-84 12:21:38 EDT
Posted: Fri Oct 12 12:21:38 1984
Date-Received: Tue, 16-Oct-84 05:37:20 EDT
Organization: Sun Microsystems, Inc.
Xref: sun net.lang.forth:132 net.micro.ti:43
> A faster way to handle returns from any particular word would be to use the
> return procedure built into the microprocessor. Rather than going through
> the bother of storing the pointers, the author argued that treating every
> word as a subroutine would provide faster execution. He then gave some
> comparisons between the usual pointer returns and the subroutine/macro
> routines that he advocated.
> Is there a significant performance improvement to be had by treating FORTH
> words as subroutines? Has anyone in netland considered this possibility?
> Are there presently some implementations of FORTH that already treat its
> words as subroutines (and uses the subroutine return native to the micro)?
This opens up a whole can of worms. There is no simple answer, so here's
the complicated one:
There are at least 5 different ways of "threading" that I know of. Every one
of these schemes has been used before.
1) Indirect threading (ITC) - compile a pointer to a pointer to the code to
execute. This is the common FIG-Forth way. It os compact an makes
it easy to implement DOES> words.
2) Direct Threaded Code (DTC) - compile a pointer to the code to execute.
This is frequently faster than (ITC), but DOES> words are not quite as
easy as with ITC.
3) Token Threaded Code (TTC) - Compile an index into a jump table of pointers
to the code to execute. This has the potential for being very compact,
and is also very easy to dynamically relocate. Usually a little slower
than either ITC or DTC, but not by much.
4) JSR Threaded Code (JTC) - compile a JSR instruction to the code to execute.
May be somewhat faster than 1,2, or 3, but see discussion below.
5) Native Code (NC) - attempt to compile machine code sequences for Forth
words. A good compromise is to compile the sequences for short CODE words
in-line, and to use JSR threading for references to longer words. This
is the way the Princeton VAX Forth works. This is the fastest scheme of
all, but depending on the processor, it may take up a lot of space.
In the case of the VAX, there is hardly any space penalty at all!
The 68000 has only a moderate space penalty; for most 8-bit micros
(with the possible exception of the 6809), the space penalty is severe.
The only way to decide which scheme is faster for a particular processor
it to write down the code to implement the inner interpreter for each
scheme, then count cycles. You MUST include the DOCOLON and EXIT routines
as part of the inner interpreter in order to make a fair comparison.
Ditto for an JMP instructions used to get to NEXT, if applicable.
It is usually okay to neglect to time the run-time code for VARIABLE,
CONSTANT, and DOES>, unless that code is horribly long.
Dynamic word frequencies I have measured for an ITC forth:
(Program was the outer interpreter compiling Forth code)
Relative frequencies of execution:
CODE words: 90%
COLON definitions: 8%
Other (VARIABLE, CONSTANT, DOES>): 2%
What this means is that saving one cycle on the execution time of
CODE words is worth about 10 cycles saved on COLON definitions.
It turns out that for the 68000, DTC is faster than JTC. The reason
is that DTC NEXT for the 68000 does MOVE (IP)+,A0 JMP (A0)
exactly once per CODE word, whereas JTC does JSR ... RTS
which involves pushing and popping a long word on the stack.
The bottom line is that you have to do a lot of work to get the
right answer for an particular processor.
Another thing to consider is the ease of implementing desireable
utilities like a decompiler. Some threading schemes are easier to
decompile than others. I personally find a decompiler to be a very
valuable tool. Other people get along quite well by having convenient
access to the source code. Since I frequently use Forth for debugging
peripheral hardware, in which case I don't have any mass storage on line,
I really use the decompiler a lot.