Path: utzoo!utgpu!water!watmath!clyde!att!rutgers!ucsd!ucsdhub!hp-sdd!hplabs!hp-sde!hpcea!hpausla!cjh
From: cjh@hpausla.HP.COM (Clifford Heath)
Newsgroups: comp.arch
Subject: Re: block copy & VAX MOVC (was Re: Explanation, please!)
Message-ID: <2220003@hpausla.HP.COM>
Date: 26 Sep 88 07:35:17 GMT
References: 
Organization: HP Australian Software Operation
Lines: 35

I played with Duffs device on an HP 9000/850 (RISC machine), and got
some interesting results.  Duffs is faster than the comparable
non-unrolled loop, but only by about 20-30%.  memcpy was heaps faster,
so I looked at the (memcpy) assembly code using a debugger.  As a result
of this I changed the unrolling factor in Duff's to 4 (not much change),
changed the auto-incr pointer addressing to short offset indexing (using
a pointer adjustment before the loop and a single increment before the
while) and got about 30% more.  The 850 has auto-increment, but it still
takes time that doesn't need to be wasted.  It also has a good global
optimizer, which seemed to do sensible things even for this strange
device.

Duffs's was STILL slower than memcpy by about 50%, and couldn't handle
byte-size moves, non-aligned moves etc etc.

Duff's is really only a way of saving the code size required to perform
the additional moves left after the unrolled loop has run, which is a
fairly poor excuse for using a device that's so hard to read.  The only
additional benefit is that the extra instructions may be in the I-cache,
which isn't really such a big deal.

The memcpy on the 850 is quite an astonishing effort, using word moves
with double register 8/16/24 bit shifts for unequally non-aligned moves.
It also has a very small setup time, so that small moves get caught
early and handled quickly.  Congratulations to the coder, a very good
effort.  Before this experiment, I was convinced that C with a good
optimizer could get within 10% of assembly code for anything.  I now
have a convincing counter-example.

In short, use the system-supplied routines for preference, and if they
prove to be slow, replace them yourself AND SEND THE CODE to the company
that wrote it.  They'll probably be grateful.

Clifford Heath, Hewlett Packard Australian Software Operation.
(UUCP: hplabs!hpfcla!hpausla!cjh, ACSnet: cjh@hpausla.oz)