Path: utzoo!attcan!uunet!munnari!otc!metro!ipso!runx!brucee
From: brucee@runx.ips.oz (Bruce Evans)
Newsgroups: comp.os.minix
Subject: Re: Bugs found installing V1.3
Message-ID: <1648@runx.ips.oz>
Date: 12 Jul 88 21:29:52 GMT
References: <1646@runx.ips.oz>
Reply-To: brucee@runx.OZ (Bruce Evans)
Organization: RUNX  Un*x Timeshare. Sydney, Australia.
Lines: 145

I found some more bugs. These bit harder so I found all reasons and some
fixes.

New tty
=======

(1) rs_struct[1].rs_base in tty.c init_rs232() is initialised when there is
no rs_struct[1]. My vid_base was was destroyed :-). Make this conditional:

#if NR_RS_LINES > 1
  rs_struct[1].rs_base = SECONDARY;
#endif

(2) SMALL_STACK in table.c is too small for TTY. The problem shows up mainly
mainly with the F2 dump. Printk uses a lot of stack. My proc_ptr was
destroyed first :-). Increase the size from 256 to 512:

#define	TTY_STACK	512		/* SMALL_STACK is just too small */

(3) base in tty.c config_rs232() uses the wrong rs232-line number. The
rs_struct's are not offset by NR_CONS. All this bug does is stop rs232 working.

base = rs_struct[line].rs_base;

(4) The keyboard initialization is probably not solid. I use a boot program
in which the equivalent of the '=' key can be hit after loading everything
from the disk, and the system can lock up. I think there is no time for
the BIOS to handle the key release, and the keyboard handler doesn't
see it because I reprogrammed the interrupt controller (to undo the
misplacement of the device vectors).  My problem was easy to fix by
strobing the keyboard during console initialization.

Perhaps there are other devices with incomplete initializations
depending on undocumented states left by the BIOS, (the screen of
course)? In any case, it is WRONG to enable interrupts before the
devices have been initialised (in main.c). I also don't like the
centralization of setting up device vectors.

After fixing all these and merging the local changes which interface to
my debugger (switching to a stand-alone set of i/o routines requires
support from the O/S since too many hardware registers are write-only),
the new tty worked, sort of. I took out the debugging statement which
echos all rs232 output to the screen. It worked at as a dumb terminal
at 9600 baud on my 386 system. Zmodem output worked, zmodem input
didn't. I thought the problem might be with a too cavalier approach to
discarding i/o and tried increasing the input buffer size without
success.

Since it's clear that the new tty is not ready, I'll go back to my
modified version of the Paradis driver, which also sort of works but
much better. I still don't think acceptable throughput (== 19200 baud
on a 5MHz 8088 with no load) can be attained without "fixing" the
kernel. The problem is that interrupts are disabled for several
millisec. I get about 4800 baud with lots of kernel changes.

(Old) mkfs
==========

My last report complained about getline() in mkfs.c not checking for
EOF properly (mainly hurts when the compiler uses unsigned characters,
but wrong anyway).  More seriously, it doesn't check for buffer
overflow and goes crazy when the EOF test fails. I have a line
"mkfs /dev/ram 384" in /etc/rc. After I somehow created an empty file
/384, mkfs crashed. I just deleted /384, but the proper method is to
avoid truncating the value returned from getc() by storing it in a
char variable.

Divide by zero trap and signals
===============================

Division by zero now gives _repeated_ traps before the divider finally
dies.

The V1.3 main.c does this after divide by zero:

cause_sig(cur_proc, SIGILL);    /* send signal to current process */
unready(proc_addr(cur_proc));   /* deschedule current process */

This probably worked in 1.2, but the V1.3 cause_sig() now calls unready
resulting in a pick_proc() so cur_proc usually changes. My fix was just
to save the original cur_proc for unready()ing.

I'm still worried about this. First, if cur_proc is a task or server
(can't happen :-)) there should be a panic. Is it safe to go straight
to cause_sig() without messaging the system task? Perhaps the PUBLIC
routines in system.c belong in proc.c to show such safety?

The failure mode for this bug was instructive. Assume no hardware
interrupts to confuse us except clock ticks. Assume we just omit the
unready() in the trap, which is effectively what the bug does since
unreadying the idle process seems to be harmless.

(1) cause_sig() calls unready(cur_proc), unready() calls pick_proc().
It may be better for unready() not to call pick_proc(), to avoid
situations like this. Note that ready() does not call pick_proc().  Are
there situations where a newly readied process languishes on the ready
queue while the idle process runs till the next interrupt?

(2) cause_sig() calls inform() which calls mini_send() which readies
MM. Then it readies the process to be signalled. This seems wrong(???),
since the signal should be the next thing the process does. I tried
removing the ready() but since the other signal code apparently doesn't
ready the process to be signaled, the signal was long delayed (till the
next update alarm on my system, perhaps since I have changed the
low-level clock code to call the clock task less).

(3) Now MM and the offending process are ready but IDLE is running.
There is an unnecessary delay till the next clock tick. There should
probably be a call to pick_proc() just after the mini_send() in
inform().

(4) The clock task runs at the next clock tick. When it is unreadied,
MM is readied at last.  MM needs to do a core dump for SIGILL, so lots
of other system activity occurs.  (Another fishy piece of code is the
pick_proc() in sched() called by the clock task (not now). Since the
clock task is running, it is at the head of the task queue and will be
picked again.  The new user is only picked after the clock task
blocks.)

(5) When the system waits for i/o, it sees the process to be signalled
is ready, runs it, and incurs another trap! It is saved from recursion
because the 1st signal eventually completes and kills the process. I
always got exactly 2 traps.

Kernel error numbers and bad message addresses
==============================================

In the profiling support routine monstartup(), sbrk() may be
inadvertently called with a negative argument. (Unlike malloc(), sbrk()
can reduce the break.) I won't fix profil here since it is non-standard,
but look at the effect: the break may get reduced below the global
message address, after which _all_ system calls, including sbrk() to
fix the problem, and exit() to give up, fail!

The simplest fix is to SIGKILL any process which gives a bad message
address.

This is also an example of the problem with ambiguous error numbers.
The kernel returns E_BAD_ADDR which is the same as an unrelated user
error number. The kernel should never return internal error numbers,
so these must be made unambiguous and turned into user numbers, or
converted before return.

Bruce Evans
Internet: brucee@runx.ips.oz.au    UUCP: uunet!runx.ips.oz.au!brucee