Xref: utzoo comp.arch:6063 comp.lang.prolog:1180
Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!cornell!uw-beaver!teknowledge-vaxc!sri-unix!quintus!ok
From: ok@quintus.uucp (Richard A. O'Keefe)
Newsgroups: comp.arch,comp.lang.prolog
Subject: Re: Perils of comparison -- an example
Message-ID: <292@quintus.UUCP>
Date: 14 Aug 88 20:41:30 GMT
References: <282@quintus.UUCP> <15221@shemp.CS.UCLA.EDU>
Sender: news@quintus.UUCP
Reply-To: ok@quintus.UUCP (Richard A. O'Keefe)
Organization: Quintus Computer Systems, Inc.
Lines: 41

In article <15221@shemp.CS.UCLA.EDU> casey@cs.ucla.edu.UUCP (Casey Leedom) writes:
>In article <282@quintus.UUCP> ok@quintus () writes:
>> 
>> ... kLI/s are defined solely by that particular benchmark, by the way.
>> Other benchmarks may be "procedure calls per second", but _only_ Naive
>> Reverse gives "logical instructions".
>
>  I believe "kLI/s" is 1000's of Logical Inferences per second (but I may
>be wrong of course).  This is normally abrieviated as kLIPS.  Really fast
>PROLOG machines are rated in mLIPS (10^6 LIPS).

Right, it is "logical _inferences_ per second".  Silly me.

There is a single specific benchmark, called naive reverse, which happens
to do 496 procedure calls.  To determine the kLI/s rating, you run this
benchmark N times, for some large N.  If it takes T seconds, you report
(496*N)/T as the LIPS rating.

When you are benchmarking, it is necessary to be precise about what you
have measured.  Some people have taken any old small program and
reported the number of procedure calls it did per second as LIPS.  It
simply won't *DO*! Procedures can have different numbers of arguments,
and the cost of head unification can range from next to nothing to
exponential in the size of the arguments.

Don't get me wrong:  Naive Reverse is not a specially good benchmark.
(Think about the fact that native code for it fits comfortably into a
68020's on-chip instruction cache...) But using *different* benchmarks
when talking about different machines can't yield better comparisons!

There is a more comprehensive set of micro-benchmarks which was described
in AI Expert last year.  Instead of a single LI/s rating, it would be
better to report an "AIE spectrum".  But even the best micro-benchmarks
don't always predict the performance of real programs well, for reasons
explained in the SmallTalk books, amongst others.

One of the things which makes the DLM article credible is that it reports
figures for several other (small) benchmarks (I surmise that "quickstart"
really meant "quicksort").  I have seen enough papers that report really
high performance where the system described seems never to have run
anything _but_ Naive Reverse.  At least the DLM is realer than that!