Path: utzoo!attcan!uunet!munnari!otc!metro!ipso!runx!brucee From: brucee@runx.ips.oz (Bruce Evans) Newsgroups: comp.os.minix Subject: Re: Bugs found installing V1.3 Message-ID: <1648@runx.ips.oz> Date: 12 Jul 88 21:29:52 GMT References: <1646@runx.ips.oz> Reply-To: brucee@runx.OZ (Bruce Evans) Organization: RUNX Un*x Timeshare. Sydney, Australia. Lines: 145 I found some more bugs. These bit harder so I found all reasons and some fixes. New tty ======= (1) rs_struct[1].rs_base in tty.c init_rs232() is initialised when there is no rs_struct[1]. My vid_base was was destroyed :-). Make this conditional: #if NR_RS_LINES > 1 rs_struct[1].rs_base = SECONDARY; #endif (2) SMALL_STACK in table.c is too small for TTY. The problem shows up mainly mainly with the F2 dump. Printk uses a lot of stack. My proc_ptr was destroyed first :-). Increase the size from 256 to 512: #define TTY_STACK 512 /* SMALL_STACK is just too small */ (3) base in tty.c config_rs232() uses the wrong rs232-line number. The rs_struct's are not offset by NR_CONS. All this bug does is stop rs232 working. base = rs_struct[line].rs_base; (4) The keyboard initialization is probably not solid. I use a boot program in which the equivalent of the '=' key can be hit after loading everything from the disk, and the system can lock up. I think there is no time for the BIOS to handle the key release, and the keyboard handler doesn't see it because I reprogrammed the interrupt controller (to undo the misplacement of the device vectors). My problem was easy to fix by strobing the keyboard during console initialization. Perhaps there are other devices with incomplete initializations depending on undocumented states left by the BIOS, (the screen of course)? In any case, it is WRONG to enable interrupts before the devices have been initialised (in main.c). I also don't like the centralization of setting up device vectors. After fixing all these and merging the local changes which interface to my debugger (switching to a stand-alone set of i/o routines requires support from the O/S since too many hardware registers are write-only), the new tty worked, sort of. I took out the debugging statement which echos all rs232 output to the screen. It worked at as a dumb terminal at 9600 baud on my 386 system. Zmodem output worked, zmodem input didn't. I thought the problem might be with a too cavalier approach to discarding i/o and tried increasing the input buffer size without success. Since it's clear that the new tty is not ready, I'll go back to my modified version of the Paradis driver, which also sort of works but much better. I still don't think acceptable throughput (== 19200 baud on a 5MHz 8088 with no load) can be attained without "fixing" the kernel. The problem is that interrupts are disabled for several millisec. I get about 4800 baud with lots of kernel changes. (Old) mkfs ========== My last report complained about getline() in mkfs.c not checking for EOF properly (mainly hurts when the compiler uses unsigned characters, but wrong anyway). More seriously, it doesn't check for buffer overflow and goes crazy when the EOF test fails. I have a line "mkfs /dev/ram 384" in /etc/rc. After I somehow created an empty file /384, mkfs crashed. I just deleted /384, but the proper method is to avoid truncating the value returned from getc() by storing it in a char variable. Divide by zero trap and signals =============================== Division by zero now gives _repeated_ traps before the divider finally dies. The V1.3 main.c does this after divide by zero: cause_sig(cur_proc, SIGILL); /* send signal to current process */ unready(proc_addr(cur_proc)); /* deschedule current process */ This probably worked in 1.2, but the V1.3 cause_sig() now calls unready resulting in a pick_proc() so cur_proc usually changes. My fix was just to save the original cur_proc for unready()ing. I'm still worried about this. First, if cur_proc is a task or server (can't happen :-)) there should be a panic. Is it safe to go straight to cause_sig() without messaging the system task? Perhaps the PUBLIC routines in system.c belong in proc.c to show such safety? The failure mode for this bug was instructive. Assume no hardware interrupts to confuse us except clock ticks. Assume we just omit the unready() in the trap, which is effectively what the bug does since unreadying the idle process seems to be harmless. (1) cause_sig() calls unready(cur_proc), unready() calls pick_proc(). It may be better for unready() not to call pick_proc(), to avoid situations like this. Note that ready() does not call pick_proc(). Are there situations where a newly readied process languishes on the ready queue while the idle process runs till the next interrupt? (2) cause_sig() calls inform() which calls mini_send() which readies MM. Then it readies the process to be signalled. This seems wrong(???), since the signal should be the next thing the process does. I tried removing the ready() but since the other signal code apparently doesn't ready the process to be signaled, the signal was long delayed (till the next update alarm on my system, perhaps since I have changed the low-level clock code to call the clock task less). (3) Now MM and the offending process are ready but IDLE is running. There is an unnecessary delay till the next clock tick. There should probably be a call to pick_proc() just after the mini_send() in inform(). (4) The clock task runs at the next clock tick. When it is unreadied, MM is readied at last. MM needs to do a core dump for SIGILL, so lots of other system activity occurs. (Another fishy piece of code is the pick_proc() in sched() called by the clock task (not now). Since the clock task is running, it is at the head of the task queue and will be picked again. The new user is only picked after the clock task blocks.) (5) When the system waits for i/o, it sees the process to be signalled is ready, runs it, and incurs another trap! It is saved from recursion because the 1st signal eventually completes and kills the process. I always got exactly 2 traps. Kernel error numbers and bad message addresses ============================================== In the profiling support routine monstartup(), sbrk() may be inadvertently called with a negative argument. (Unlike malloc(), sbrk() can reduce the break.) I won't fix profil here since it is non-standard, but look at the effect: the break may get reduced below the global message address, after which _all_ system calls, including sbrk() to fix the problem, and exit() to give up, fail! The simplest fix is to SIGKILL any process which gives a bad message address. This is also an example of the problem with ambiguous error numbers. The kernel returns E_BAD_ADDR which is the same as an unrelated user error number. The kernel should never return internal error numbers, so these must be made unambiguous and turned into user numbers, or converted before return. Bruce Evans Internet: brucee@runx.ips.oz.au UUCP: uunet!runx.ips.oz.au!brucee