March 5, 2011

Serious bug. Seriously.

A couple days ago I discovered a somewhat serious bug in the kernel in which it throws an address error exception. The top of the stack frame at the beginning of the exception handler looked like this:

0015
0000
002f

The stack frames for most of the exception and trap handlers on the M68K contain the saved Status Register and saved Program Counter at the top. If this were the case with the address error exception, the Status Register would be 0x0015 (a reasonable value) and the Program Counter would be 0x0000002f (an invalid value).

I spent the longest time trying to figure out how the kernel is jumping to the address 0x0000002f. I should have instead read the M68000 Programmer's Reference Manual. On page 630 it describes each of the exception stack frames. Guess what? The Address Error Exception stack frame contains the Access Address where the Program Counter would be. This means some code in the kernel was merely accessing the address 0x0000002f. (The 68000 can access only even addresses for word and long accesses.)

I finally narrowed it down to a function call to sched_run. This function takes a pointer to a process structure. The sched_run function accesses the "p_deadline" member of the "proc" structure, which resides at offset 0x2e. In one place I called the function without an argument (a major reason to use function prototypes!), so sched_run took whatever value was on the stack at the time. That value on the stack always happened to be 0x00000001, so sched_run tried to access a long-word at address 0x00000001+0x2e, or... 0x0000002f. There you go!

It turns out that that offending call to sched_run was a bug in itself. At the time it was being called, the process was already scheduled to run, so it didn't need to be scheduled to run again. I ended up removing the line completely, which killed two birds with one stone (or with one "dd" command in vim :D).

During this whole ordeal I trekked across many lines of code and found a small handful of other bugs as well. Some were potential synchronization issues, and others were just code cleanups and simplifications, particularly with the timer and system clock routines. For example, I changed the process interval timers (the ITIMER_REAL timer in setitimer(2)) to rely on a separate monotonic clock instead of the system clock. The result is that, even if you change the system clock forward or backward (or use adjtime() to speed up/slow down the system clock temporarily), the real-time interval timers will now run at exactly the same rate and phase. They are not affected by any changes to the system clock.

All in all, I believe the scheduler and timer code is now quite robust and stable. Onward to finishing vfork()...