Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!lll-crg!ames!ucbcad!ucbvax!YALE.ARPA!LEICHTER-JERRY From: LEICHTER-JERRY@YALE.ARPA Newsgroups: mod.computers.vax Subject: Re: disk compresses Message-ID: <8612041056.AA17549@ucbvax.Berkeley.EDU> Date: Thu, 4-Dec-86 05:57:16 EST Article-I.D.: ucbvax.8612041056.AA17549 Posted: Thu Dec 4 05:57:16 1986 Date-Received: Thu, 4-Dec-86 09:07:33 EST Sender: daemon@ucbvax.BERKELEY.EDU Reply-To:Organization: The ARPA Internet Lines: 127 Approved: info-vax@sri-kl.arpa We have about 12 RA81's in a VAX cluster (8650, 785 and 2 750's). The user population is around 400. The main programs are large simulations. Disk fragmentation has become a problem, requiring disk compresses every two weeks. Wierd program behaviour (like taking 3 to 4 times to run) was observed before one of the compress sessions. No-one is happy about losing the entire cluster for most of a day (10 hours+) while the compress goes on.... Is our situation normal? How often do other sites compress their disks? What is the experience of other sites? Are there general guidelines to reducing the frequency of compresses? What is the optimal way to organize disks? (ie. number of user disks, system disks, scratch disks, where files are stored, etc.) .... Let's step back and look at how VMS disk space allocation works. On each disk, there's a bit map, with one bit per cluster, where a cluster is a group of c consecutive disk blocks. (You set c, the cluster size, when you initialize the disk.) Bits are clear for unused blocks, set for used blocks - or the other way around; I don't recall, and it doesn't matter for our purposes. When an allocation request is made for k blocks, the bit map is scanned for k/c consecutive 0 bits, and the corresponding blocks are used. When blocks are freed, the corresponding bits are cleared. Note that there is no record of where the boundaries between previously-allocated groups of blocks were placed - free segments are implicitly merged. (Actually, in a cluster things are a bit more complex. Each cluster member keeps has its own copy of the bitmap. The copies must obviously be kept in agreement. To avoid overhead, each member pre-allocates some number of clusters, which it can now use for local requests without having to inform the other members. If it needs more blocks than it has pre-allocated, it has to coordinate with the other cluster members; a process called something like CACHE_SERVER runs on each cluster member and does this coordination. If a disk isn't dismounted properly - if the system crashes, for example - the pre-allocated blocks are lost. That's what the rebuild operation during MOUNT is all about - it scans all the file headers and builds an up-to-date allocation bitmap. If you start with an unfragmented disk, allocate a lot of files, and then delete them all, you will end up with exactly the same free-space configura- tion you started with. You can only get fragmentation if you interleave the allocations of two sets of files, A and B, then delete all the A files while retaining all the B files. As long as the B files are there, they will be splitting up - fragmenting - the free space that was part of the B files. I've over-simplified by talking about files. In fact, files often grow dynamically. What matters is not files but file extents. Suppose I open two files A and B, then alternately write to each of them, allowing each to extend until the disk is full. A and B's extents - contiguous groups of blocks - will alternate. If I now delete A, the disk will be terribly fragmented by all the pieces of B. Tying all this together, we can make a couple of rules to minimize disk fragmentation: 1. Put temporary files and more permanent files on separate disks. (Most studies of running systems show that files fall into two classes: Those that stick around for a short time, and those that stay essentially forever. The exact values of "short time" and "essentially forever" will depend on the particular mix of programs. You probably have a large popu- lation of files that exist as long as typical simulation runs, say several hours, plus many files, like source files, that stick around for weeks to years.) 2. Pre-allocate files to your best estimate of their maximum size. This will tend to keep them to a small number of large ex- tents. (It will also improve performance - extending a file is an expensive operation.) 3. For files that you can't pre-extend, but know will grow, use a large extension quantity. (This is the number of blocks by which RMS will extend the file when it needs to. See the RMS documentation for more information.) 4. Similarly, on disks that will contain mainly large files, use a large cluster size. This will gain performance in almost every way at a modest cost in wasted disk space. (The space wasted is, on average, (c-1)/2 per file for cluster size c. For disks with a lot of small files, this is likely to be a problem; for disks with a small number of large files, it is usually irrelevant.) 5. Keep disks with a lot of allocation and deletion from filling up. Fragmentation increases rapidly for such disks when they fill - a similar phenomenon occurs in hash tables. An active disk that's 95% full will be very fragmented; if the same disk were 75% full, the merging of free segments would be much, much more effective and the fragmentation wouldn't (on average, assuming "random" allocations and deletions, not the kind of worst-case examples I gave above) be bad at all. As with hash tables, the curve gets pretty steep at large "fill factors", so if the disks are quite full, every little bit helps. You may find that you still have to compress the disks, though perhaps not as frequently. If you can afford a spare disk, you can avoid taking the whole cluster down. (Considering typical salaries, it doesn't take much down time to cost as much as an extra disk!) With a spare disk, you can do a disk-to-disk backup (and compress) of one disk at a time. Only those users who use that particular disk are unable to use the system during the backup, which is simple and doesn't take long (an hour and a half at most). That's what we do here; I got the idea from the system manager of a large cluster at DEC. Note that you should be sure to avoid references to particular physical disk drives; any given volume will migrate around from drive to drive as backups are done. This is exactly what concealed device names are for! You can minimize inconvenience by not putting users' default directories on the same disk with those large simulation files. Since it will be the latter disks that need most frequent compression, users would still be able to log in - they might just not be able to run some programs. (In any case, this is part of the policy of separating disks with "permanent" files from those with "temporary" ones where possible.) For this policy to be effective, you shouldn't have to compress the system disk, as that DOES require taking the whole cluster down. This means that you should have no active file creation/deletion on the system disk - just system files. That's a good idea anyway. Also, you won't be able to do this if you use volume sets; you'd need as many free spindles are you have disks in the largest volume set. This is one aspect of the most serious liability of volume sets - you have to back up whole sets at once. -- Jerry -------