linux scheduling

Newsgroups: comp.os.linux
From: pnakada@oracle.com (Paul Nakada)
Subject: linux scheduling
Nntp-Posting-Host: pnakada.us.oracle.com
Organization: Oracle Corporation, Redwood Shores, CA
Distribution: comp
Date: Mon, 15 Jun 1992 21:20:16 GMT
X-Disclaimer: This message was written by an unauthenticated user
              at Oracle Corporation.  The opinions expressed are those
              of the user and not necessarily those of Oracle.

Linux Hackers -

Could someone (linus?) please give a high level description of the
scheduling mechanism in Linux?  

Some things of concern:
  process timeout
  process blocking (for i/o)
  process wake-up  (i/o interrupt)
  process scheduling (description of queues, when / where processes
                      are added to queues)
  what is a jiffy?

how about a little "walk through"
i.e.  

a) a process calls a kernel trap;
b) process enters kernel mode, current processor state saved.
c) pop return address from stack and save
d) kernel initiates disk i/o and places process on i/o queue
e) get next active process
f) push return address on stack and return from handler to new process
g) process times out
  etc etc etc.

why do i ask this?  I've been studying the scheduling code for a
couple days, trying to absorb enough to take a stab at improving
response time under heavy disk i/o.  I can understand the parts as
separate modules, but it's not obvious to me how they work together as
a system.  I imagine it's second nature to Linus and other veteran
kernel hackers, but all I've got is an OS class from a few years back.

Thanks for any help.

- Paul

--

Paul Nakada  |  Oracle Corporation  |  pnakada@oracle.com

From: torvalds@klaava.Helsinki.FI (Linus Benedict Torvalds)
Newsgroups: comp.os.linux
Subject: Re: linux scheduling
Date: 15 Jun 92 23:13:13 GMT
Organization: University of Helsinki

In article < PNAKADA.92Jun15132016@pnakada.oracle.com> pnakada@oracle.com 
(Paul Nakada) writes:
>
>Could someone (linus?) please give a high level description of the
>scheduling mechanism in Linux?  

The scheduler in linux is pretty simple, but does a reasonably good job
at giving good IO response while not being too unfair against cpu-bound
processes.  Simply put:

Linux gets a timer-interrupt at every jiffy (100 Hz), and decrements the
current->counter.  If current->counter is zero, and the process isn't
running in kernel mode, the scheduler() is called.  Also, when a process
is woken up and it's counter is greater than the current process, a flag
is set, which forces the re-scheduling. 

The scheduling is pretty simple: when schedule() is called, the process
that has the highest non-zero counter and is runnable is run. If no
runnable process exists, task[0] ("swapper") gets to run, but it
currently doesn't do anything. In the future, the swapper task might do
page-aging or similar.

If all runnable processes have a zero counter, the schedule() algorithm
goes through all processes once more, giving them a new counter-value of
"priority+counter/2". It might seem useless to add the counter/2, as
counter is 0, but in fact this is done to /all/ processes, including
sleeping ones, which thereby get higher counter-values for when they
wake up.

The above algorithm seems to be pretty fair, and works well while being
very easy to understand.  It also gives much snappier response than many
systems I've seen.  And it's all done without any queues at all, which I
find clean and simple.  In fact, the scedule() algorithm isn't more than
a couple of lines in C, and going to sleep or waking up is just a matter
of setting the process status and calling schedule(). 

>a) a process calls a kernel trap;
>b) process enters kernel mode, current processor state saved.

In fact, the current processor state is never saved anywhere as in
"normal" unices. A kernel trap under linux is very much like a system
call: registers get saved on the stack as needed, but we don't mess
around with updating task-structures etc.

>c) pop return address from stack and save

No, let it lie on the stack: when the process returns it gets restored.
Why save it anywhere else?

>d) kernel initiates disk i/o and places process on i/o queue
>e) get next active process
>f) push return address on stack and return from handler to new process
>g) process times out
>  etc etc etc.

It's all much simpler than this: the process, running in kernel mode,
does what it wants, and if it wants to sleep, it just does

	current->state = TASK_INTERRUPTIBLE;
	schedule();

assuming it has made sure it will wake up some way (due to signals or
the kernel time-out feature or by putting the task-struct address in a
wait-variable). The other processes run happily in between, and when
something causes the process to wake up, it just continues after the
schedule() call.

When it's done all the things the system call (or page-exception or
whetever) required it to do, it just does a return, and it's back in
user mode. 

As to how schedule() switches tasks, it's builtin in the 386, so it's
actually just a couple machine-code instructions.  It's slightly more
complicated than just a jump through a task-segment, as it checks that
it's not the same task, and for minimal 387-saving by testing the
"last_task_used_math" variable and doing a clts if possible, but it's
still less than 10 insns.  To understand what /really/ goes on, you'd
better read a 386 book, but it's not too complicated if you just care
about the general idea. 

>why do i ask this?  I've been studying the scheduling code for a
>couple days, trying to absorb enough to take a stab at improving
>response time under heavy disk i/o.  I can understand the parts as
>separate modules, but it's not obvious to me how they work together as
>a system.  I imagine it's second nature to Linus and other veteran
>kernel hackers, but all I've got is an OS class from a few years back.

I doubt the scheduler is the problem when it comes to response-time
under heavy disk i/o.  It's probably at least partly because of the way
buffer-cache blocks are locked, which sometimes forces the buffer
algorithms to sleep on them unnecessarily.  Thus if a buffer-request
comes through when blocks are locked due to actual IO, the request
sometimes blocks just to make sure there are no races.

I'd take a look at fs/buffer.c before doing anything else.  But be
/very/ careful when changing buffer.c: it's very easy to add a
race-condition if you don't think of all the possible things that can
happen due to interrupts etc while a buffer is searced for. 

		Linus