Path: utzoo!utgpu!watmath!clyde!att!ucbvax!CAEN.ENGIN.UMICH.EDU!donp From: donp@CAEN.ENGIN.UMICH.EDU (Don Peacock) Newsgroups: comp.sys.apollo Subject: Re: more SR10 questions Message-ID: <401809f8c.001766d@caen.engin.umich.edu> Date: 6 Dec 88 14:13:46 GMT Sender: daemon@ucbvax.BERKELEY.EDU Organization: The Internet Lines: 93 From: krowitz@richter.MIT.EDU Subject: Re: more SR10 questionsOne way in which backups can be speed up is the method used by Workstations Solutions' backup product. They start clients on several nodes which all feed data back to a server which writes the tape. Since the clients run independently of each other they can process several disks simultaneously and send the server buffers of data which have already been formatted for the backup tape. The only drawback to this approach is that you wind up with files from multiple disks all interleaved in a single backup file on the tape rather than in seperate backups. It is easier to retrieve files from a backup when you know for certain which tape it is on. Incremental backups, however, are frequently done which several disks all on a single tape, in which case the method used by Workstation Solutions gives the same results a whole lot faster. As Paul Anderson (pha@caen.engin.umich.edu) stated, we have 450 Apollos and do daily incrementals (around 2 gigs/day) and weekly full backups. We use some home grown software to keep up with this mess and I will quickly try to explain how it works. I have left out most of the specifics but it really does work, and better than I had expected while designing it. Incrementals: 1) We have a bank of nodes (6 dn4000's with 329meg formatted disks) which are used for storing the incremental trees. (more about this later) 2) Each node tries to do an incremental backup every half hour thru cron, What it actually tries to do is a cpt of the appropriate trees to an incremetal node, this is where the date time stamp is checked. 3) We have a locking mechanism for limiting the number of nodes cpt'ing to an incremental node at one time. (currently this is set at 6) Simple math tells us that this gives us a maximum of 36 concurent cpt's at any given time. The incremental code also watchs the incremental nodes disk space and aborts if/when it thinks it can no longer finish and leave a certain amount of disk space on the incremental node (this is a buffer zone which is needed when the incremental node later goes to tape with its data, currently 10Megs). 4) Currently backup operators dump these incremental nodes to magtape and by around 10 or 11:00 am we have our incrementals for the day done and to tape. We currently are completeing all but 6-10 nodes /day and these nodes are not getting done due to hardware problems etc. 5) We can easily monitor which disks have not done their incrementals because it mails is a list every morning of the disks that have not done their backups for three consecutive days. (this morning there was 11 nodes in this category for one reason or another) There is a backup person responsible for checking these problem nodes out each morning and responding to the rest of the backup group with the cause and status. 6) we use our full backup code to clean the incremental disks off each day and automatically deleteing the incremental trees once they are safely put to tape and logged. Our logging automatically keeps listings of wbaks and creates the labels for the tapes, So our restore program (rest_req) can easily let a backup operator know which tapes need mounted etc. 7) We have bought a couple 8mm tapes and are going to automate our incrmentals further. By allowing the backup operators to simple swap tapes once a day for incrementals, instead of using 10-20 mag tapes each day. Full backups 1) A Network wide logging scheme is used (similiar to incrementals) so we can keep track of ANY node that has not had a Full backup in the last 7 days. 2) To run a full backup a backup operator simply crp's onto a node with a magtape and runs our backup code. He then simply follows the instructions (ie load tape, swap tape, label tape with xxx.xx... etc). 3) ALL the logging etc is taken care of automatically. Now that I have tried to explain how we do our backups I would like to make some comments that don't necessarily relate to backups but relate to this news group that I feel sure many of you will take issue with. 1) Although wbak and rbak are slow it is because of what they do and how they intelligently interact with a VERY ROBUST network file system (NOT NFS). 2) The tools for manageing a LARGE network of Apollos are NOT there, but the underlying capabilities for creating these tools are available and taken for granted by the majority of those people that constantly flame Apollo for not being a vanilla Unix machine. (thank GOD its not because we couldn't keep 450 vanilla Unix machines happy without at least ten times the effort that it takes us to manage the APOLLOS) 3) I dont agree with everything Apollo has done over the past couple of years but I do know that my job is easier because of their capabilities and Apollo's efforts to not be pulled Backwards into the stone ages by a group of people worshipping an operating system that was never intended for anything more than a stand alone machine. I do like the Unix interface but this beauty is only skin deep and needs a STRONG underlying structure to give us the ability to manage an entire network as a single machine. Don Peacock University of Michigan donp@caen.engin.umich.edu