Path: utzoo!attcan!uunet!lll-winken!lll-tis!helios.ee.lbl.gov!pasteur!ucbvax!decwrl!labrea!rutgers!ucla-cs!admin.cognet.ucla.edu!casey
From: casey@admin.cognet.ucla.edu (Casey Leedom)
Newsgroups: comp.arch
Subject: Re: Help: Hashing on TLB input?
Keywords: TLB, cache, mapping, hashing, collision, thrashing
Message-ID: <16102@shemp.CS.UCLA.EDU>
Date: 21 Sep 88 09:11:11 GMT
References: <3907@psuvax1.cs.psu.edu> <22876@amdcad.AMD.COM> <16891@apple.Apple.COM>
Sender: news@CS.UCLA.EDU
Reply-To: casey@cs.ucla.edu (Casey Leedom)
Organization: UCLA
Lines: 54

In article <16891@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
> []
> >  tim@crackle.amd.com (Tim Olson) writes:
> > .. stuff about how to avoid thrashing the TLB...
> > The most obvious hash is to use the lsb's of the virtual address as an
> > index into the TLB.  This is better than using the msb's, because
> > addresses exhibit the principal of locality, so we want sequential pages
> > to map to different TLB sets.  This scheme can be augmented in many ways.
> 
> It is often not enough to use the LSBs, because then the first page of
> every process would collide, or the heap (allocated to high mem.) would
> collide with the stack in low mem., or vice versa), or user and system
> pages would collide. So, the hashing that I've seen exclusive-ors the
> msb's of the page number with the lsb's, sometimes reversing the bits of
> one half or the other to really get them good and random.  

  Another source of problems with simply using the LSBs is if an
application manipulates two (or more) arrays who's ``corresponding
addresses'' are 2^N away from each other (where N is the number of LSB
bits used to index the TLB).  If a situation like this occurs, any linear
access of the two (or more) arrays will cause *MASSIVE* thrashing.

  This actually happened with a group trying to do astronomic image
processing at Berkeley using a Sun 3/2XX.  They were copying a 1/4Mb
image array 30 times a second as part of their application.  They'd
justified buying the Sun 3/2XX based on their need for speed.  Much to
their dismay, the Sun 3/2XX performed the memory copy about 3 times
slower than the Sun 3/1XX!

  The problem turned out to be that the Sun 3/2XX cache was 64Kb with 16
byte lines indexed by a formula very dependent on the LSBs of the
address.  The loop would pick up a 4 byte long word causing a line to be
faulted in.  It would write that into the destination array causing that
line to be faulted in.  On the next read, the same source line would have
to be faulted in, but now that line in the cache was dirty, so a 30 cycle
write back would go down ...

  We finally got Sun 3/1XX performance by offsetting the source and
destination arrays by 24 bytes.  Because the array being copied was much
larger than the cache, we got best performance at:

	    | D - S | % 16 == 8
	&&  | D - S | >= 16
	&&  | D - S | <= 64K - 16

with a sin wave performance curve with the peaks as above and the valleys
half way in between and also tapering out at the end points (hence the
second and third conditions).

  I haven't sufficiently analyzed why the 24 byte offset (and other
similar offsets) had the effect they did.  It would require a better
understanding of the exact indexing algorithm being employeed by the Sun
3/2XX cache.

Casey