Path: utzoo!utgpu!water!watmath!clyde!att!ucbvax!decwrl!granite!jmd
From: jmd@granite.dec.com (John Danskin)
Newsgroups: comp.arch
Subject: Re: m88000 benchmarks
Keywords: FFT benchmarks M88K MIPS
Message-ID: <249@granite.dec.com>
Date: 15 Jul 88 17:28:55 GMT
Organization: DEC Technology Development, Palo Alto, CA
Lines: 127
x
x
x
I have screwed around a little bit with Earl's code to try to
answer some of wca@oakhill (William C. Anderson)'s questions
about performance with non unity stride. There is probably a better
answer, but here goes:
>In article <2532@wright.mips.COM> earl@mips.COM (Earl Killian) writes:
>>In article <1359@claude.oakhill.UUCP> wca@oakhill.UUCP (I) wrote:
>
> [FFT inner loop in C:]
>
>>> for (i=0 ; i>> l = i+n2;
>>> xrt = xr[i] - xr[l];
>>> xr[i] = xr[i] + xr[l];
>>> xit = xi[i] - xi[l];
>>> xi[i] = xi[i] + xi[l];
>>> xr[l] = c*xrt - s*xit;
>>> xi[l] = c*xit + s*xrt;
>>> }
>
>Mr. Killian responded with a hand-coded inner loop for the MIPS
>architecture:
>
>> l.s f0, 0(t0) # xr[i]
>>l: l.s f2, 0(t1) # xr[l]
>> l.s f4, 0(t2) # xi[i]
>> sub.s f10, f0, f2 # xrt = xr[i] - xr[l]
>> l.s f6, 0(t3) # xi[l]
>> mul.s f20, f28, f10 # c*xrt
>> sub.s f12, f4, f6 # xit = xi[i] - xi[l]
>> addu t0, 4
>> addu t1, 4
>> mul.s f22, f30, f12 # s*xit
>> add.s f14, f0, f2 # xr[i] + xr[l]
>> addu t2, 4
>> addu t3, 4
>> mul.s f24, f28, f12 # c*xit
>> add.s f16, f4, f6 # xi[i] + xi[l]
>> s.s f14, -4(t0) # xr[i] = xr[i] + xr[l]
>> l.s f0, 0(t0) # xr[i+1]
>> mul.s f26, f30, f10 # s*xrt
>> sub.s f20, f20, f22 # c*xrt - s*xit
>> s.s f16, -4(t2) # xi[i] = xi[i] + xi[l]
>> s.s f20, -4(t1) # xr[l] = c*xrt - s*xit
>> add.s f24, f24, f26 # c*xit + s*xrt
>> bne t0, t4, l
>> s.s f24, -4(t3) # xi[l] = c*xit + s*xrt
>>
>>Summary: 23 instructions, 23 cycles, 10.9 mflops @ 25MHz, and 25
>>native mips. No pipelining within an op unit required (the R3010 has
>>none). Multiply/add/load/store overlap is extensively used.
>
>One problem with this code is that is assumes the "stride" of the loop
>(the varible "n1" in the C code segment above) is unity!
>
>What about the code for the inner loop in the general case? What
>effect does the assumption of non-unity stride have on the MIPS loop
>timing?
# 4 * n1 is in a register named 'n1'
l.s f0, 0(t0) # xr[i]
l: l.s f2, 0(t1) # xr[l]
l.s f4, 0(t2) # xi[i]
sub.s f10, f0, f2 # xrt = xr[i] - xr[l]
l.s f6, 0(t3) # xi[l]
mul.s f20, f28, f10 # c*xrt
sub.s f12, f4, f6 # xit = xi[i] - xi[l]
addu t5, t0, n1
addu t6, t1, n1
mul.s f22, f30, f12 # s*xit
add.s f14, f0, f2 # xr[i] + xr[l]
addu t7, t2, n1
addu t8, t3, n1
mul.s f24, f28, f12 # c*xit
add.s f16, f4, f6 # xi[i] + xi[l]
s.s f14, 0(t0) # xr[i] = xr[i] + xr[l]
l.s f0, 0(t5) # xr[i+1]
mul.s f26, f30, f10 # s*xrt
sub.s f20, f20, f22 # c*xrt - s*xit
s.s f16, 0(t2) # xi[i] = xi[i] + xi[l]
s.s f20, 0(t1) # xr[l] = c*xrt - s*xit
add.s f24, f24, f26 # c*xit + s*xrt
be t5, t4, e
s.s f24, 0(t3) # xi[l] = c*xit + s*xrt
l.s f2, 0(t6) # xr[l]
l.s f4, 0(t7) # xi[i]
sub.s f10, f0, f2 # xrt = xr[i] - xr[l]
l.s f6, 0(t8) # xi[l]
mul.s f20, f28, f10 # c*xrt
sub.s f12, f4, f6 # xit = xi[i] - xi[l]
addu t0, t5, n1
addu t1, t6, n1
mul.s f22, f30, f12 # s*xit
add.s f14, f0, f2 # xr[i] + xr[l]
addu t2, t7, n1
addu t3, t8, n1
mul.s f24, f28, f12 # c*xit
add.s f16, f4, f6 # xi[i] + xi[l]
s.s f14, 0(t5) # xr[i] = xr[i] + xr[l]
l.s f0, 0(t0) # xr[i+1]
mul.s f26, f30, f10 # s*xrt
sub.s f20, f20, f22 # c*xrt - s*xit
s.s f16, 0(t7) # xi[i] = xi[i] + xi[l]
s.s f20, 0(t6) # xr[l] = c*xrt - s*xit
add.s f24, f24, f26 # c*xit + s*xrt
bne t0, t4, l
s.s f24, 0(t3) # xi[l] = c*xit + s*xrt
e:
This code has exactly the same performance as earl's code except that
is also handles the non-unity case. Yet another example of needing lots
more registers to write good code in pipelined machines. Can I have lots
more registers please?
--
John Danskin | decwrl!jmd
DEC Technology Development | (415) 853-6724
100 Hamilton Avenue | My comments are my own.
Palo Alto, CA 94306 | I do not speak for DEC.