linux scheduler alternatives???

Path: gmd.de!newsserver.jvnc.net!howland.reston.ans.net!
europa.eng.gtefsd.com!uunet!news.univie.ac.at!fstgds15.tu-graz.ac.at!
piis10.joanneum.ac.at!zloebl
From: zlo...@piis10.joanneum.ac.at (Klaus ZLOEBL)
Newsgroups: comp.os.linux.development
Subject: linux scheduler alternatives???
Date: 27 Sep 1993 11:19:36 GMT
Organization: Graz University of Technology, Austria
Lines: 22
Message-ID: <286i88INN9rh@fstgds15.tu-graz.ac.at>
NNTP-Posting-Host: piis10.joanneum.ac.at
X-Newsreader: TIN [version 1.1 PL8]

I found some time last weekend and read the khg
after looking around in the kernel sources i had to notice
that i don't fine the place where the timeout timer is set.
So could somebody be so nice and tell me where to find it.

Also i'd like to write a new scheduler (multilevel feedback queue)

is there a new khg ? (i haven't seen one)

What algorithm is use for paging?

I have played with osp (an operating system simulator), but
a real OS would be an even better playground!

Thanks for any hints and pointers
--

Klaus Zloebl          | E-Mail: zlo...@piis10.joanneum.ac.at
Joanneum Research     | PSI   : PSI%2631102911::ZLOEBL
Steyrergasse 17       |
A-8020 Graz           | Phone: ++43/316/8020/243
AUSTRIA               |

Path: gmd.de!newsserver.jvnc.net!howland.reston.ans.net!pipex!sunic!
news.funet.fi!klaava!klaava!not-for-mail
From: torva...@klaava.Helsinki.FI (Linus Torvalds)
Newsgroups: comp.os.linux.development
Subject: Re: linux scheduler alternatives???
Date: 28 Sep 1993 13:18:01 +0200
Organization: University of Helsinki
Lines: 33
Message-ID: <2896h9$ou1@klaava.Helsinki.FI>
References: <286i88INN9rh@fstgds15.tu-graz.ac.at>
NNTP-Posting-Host: klaava.helsinki.fi

In article <286i88INN...@fstgds15.tu-graz.ac.at>,
Klaus ZLOEBL <zlo...@piis10.joanneum.ac.at> wrote:
>I found some time last weekend and read the khg
>after looking around in the kernel sources i had to notice
>that i don't fine the place where the timeout timer is set.
>So could somebody be so nice and tell me where to find it.

It's not there.  Linux just uses the normal timer to keep the heartbeat:
after all, we don't allow just anybody to mess with interrupts, so the
timeout timer (I assume you meant the one that sends an NMI) isn't
needed. 

Actually, I've been thinking of using it for a better kernel profiler,
as the current profiler doesn't work too well in interrupt handlers
(understatement of the week), and one based on NMI would work.  I've
never bothered to find out how it works, though. 

>Also i'd like to write a new scheduler (multilevel feedback queue)

There has been one that used something like this, and reportedly worked
better under heavy load.  I have to admit to liking the current one,
though: simple and reasonably well-behaved. 

>What algorithm is use for paging?

It's essentially a modified "clock" algorithm, which seems to work
reasonably well.  I once tried a NRU-like thing, but it was so good at
finding pages that belonged to programs that were waiting for user input
that I gave up on it.  I could probably have tuned it to give better
performance, but the clock algorithm is reasonably fair, needs no
tuning, and doesn't give *too* bad results. 

			Linus

Path: gmd.de!newsserver.jvnc.net!udel!gatech!swrinde!cs.utexas.edu!
uunet!news2.uunet.ca!spool.mu.edu!sgiblab!idiom.berkeley.ca.us!
moonshot.west.oic.com!moonshot.west.oic.com!not-for-mail
From: dil...@moonshot.west.oic.com (Matthew Dillon)
Newsgroups: comp.os.linux.development
Subject: Re: linux scheduler alternatives??? - MY IDEA
Date: 1 Oct 1993 02:03:19 -0700
Organization: Homeless Electron
Lines: 178
Distribution: world
Message-ID: <28gron$f27@moonshot.west.oic.com>
References: <286i88INN9rh@fstgds15.tu-graz.ac.at>
NNTP-Posting-Host: moonshot.west.oic.com

In article <286i88INN...@fstgds15.tu-graz.ac.at> 
zlo...@piis10.joanneum.ac.at (Klaus ZLOEBL) writes:
>I found some time last weekend and read the khg
>after looking around in the kernel sources i had to notice
>that i don't fine the place where the timeout timer is set.
>So could somebody be so nice and tell me where to find it.
>
>Also i'd like to write a new scheduler (multilevel feedback queue)
>
>is there a new khg ? (i haven't seen one)
>
>What algorithm is use for paging?
>
>I have played with osp (an operating system simulator), but
>a real OS would be an even better playground!
>
>Thanks for any hints and pointers
>--
>
>Klaus Zloebl          | E-Mail: zlo...@piis10.joanneum.ac.at
>Joanneum Research     | PSI   : PSI%2631102911::ZLOEBL
>Steyrergasse 17       |
>A-8020 Graz           | Phone: ++43/316/8020/243
>AUSTRIA               |

    I have some ideas re: scheduling.  Specifically, the scheduler I did
    for a recent embedded OS.

    It's real simple, yet real powerful.

    It requires NO list traversal calculations at any time.  There is
    exactly ONE run queue and ONE wait queue.  It's incredibly easy to
    manage.

    Essentially, it works like this:  Each process is given a fixed 8 bit 
    unsigned priority.  The time-slice the process gets is:

		     pri
	maxTime * ----------
		   sum-pri

    maxTime = 50mS (smaller on faster machines.  The smaller this number,
	      the greater the overhead, but the better the realtime response)
    pri     = priority of currently running task
    sum-pri = aggregate priority of all RUNNABLE tasks

    As you may have figured out, this scheme is really quite awesome in that
    process priorities are all relative and queue management is minimized.
    In my embedded OS, a complete task switch took 50uS on a 10MHz 68000.
    A 386 or 486 is at least 10 times faster (a 486DX2 is approximately
    20 times faster), so I would expect task switch times on the order of
    5-10uS...

    Furthermore, realtime response can be maintained even for low priority
    tasks because they get some amount of time each 50mS period (more on
    that in a moment).  Block/Unblock overhead is also greatly minimized.

    The only parameter the system needs to maintain is <sum-pri>, the
    aggregate sum of the priorities for all runnable tasks.  In otherwords,
    you add a task's priority to <sum-pri> when you put it on the run list,
    subtract it when you take it off, and modify <sum-pri> when a task
    who is on the run list is modified (i.e. renice).

    The implementation is even simpler... there is NO multiplication or
    division actually involved in the task switch interrupt itself, saving
    a lot of time.  Basically, you setup a fixed-rate timer interrupt at
    approximately 100Hz (10mS).  The faster the better... my embedded
    operating system implemented a 2mS time slice.  For me, the overhead was
    around 10uS/2mS (you don't switch tasks every interrupt!), or 0.5%.

    For a 386/486 with its greater interrupt overhead I would expect about
    the same percentage using a 2mS interrupt.

    What you do when the intrrupt occurs is a single subtraction:

	task->counter = task->counter - sum_pri

    when the counter goes negative you switch to the next task.  Note that
    YOU LEAVE THE COUNTER NEGATIVE!  You do not zero it.

    When you switch in the next task in the run queue you perform the 
    following calculation:

	task->counter = task->counter + (task->pri * CONSTANT)

    Where (task->pri * CONSTANT) is PRECALCULATED (only calculated when the
    task's priority actually changes, which is usually never!).  Now comes
    the tricky part.  If the counter is still negative, you skip the task
    and go on to the next one.  If the counter becomes positive you continue
    with the switch in and let it run.  the CONSTANT is directly related
    to the <maxTime> variable and based on the frequency of the timer
    interrupt verses <maxTime>.  It is effectively a constant.

    The end result is that the AVERAGE cpu the task gets is EXACTLY the
    ratio of its priority over the aggregate priority of all runnable
    tasks, NO MATTER WHAT INTERRUPT RATE YOU USE FOR YOUR BASE TIMER.  All
    the interrupt rate of the timer does is provide you with finer 
    granularity allowing you to get better real-time response from tasks
    in a heavily loaded system.

    The system has many advantages:

	* low overhead in a lightly loaded system because the time slices
	  are larger.  For example, two processes of equal priority with
	  <maxTime> set to 50mS will both get 25mS time slices.

	* Minimum time slice in a heavily loaded system.  A task will get
	  a minimum time slice of one interrupt period.  Instead of the
	  time slice getting smaller, the task will start to skip round
	  robin periods when the switch-in addition to task->counter fails
	  to make it positive again.

	* Adjustable real-time response.  Making <maxTime> smaller will 
	  generally increase your real time response at a cost of switch
	  overhead.  I suggest 50mS... That will allow the load factor
	  to get to 25.0 with a 2mS timer interrupt before real time 
	  response starts to suffer.

	* Extremely low overhead queueing and dequeueing operations and no
	  other maintenance required.  Both the run and the wait queues are
	  circular.  When you move a task to the run queue you ALWAYS ADD
	  IT TO THE END OF THE QUEUE, JUST BEHIND THE CURRENTLY RUNNING 
	  TASK.  This is necessary in order to avoid task starvation due to
	  a high rate of deschedule/reschedule operations (i.e. synchronous
	  IPC between two tasks could bring the rest of the system to a 
	  grinding halt).

	  But, the system will still compensate for the scheduling of a high
	  priority task.  It's automatic and its free... the act of scheduling
	  in a high priority task will cause <sum_pri> to increase and thus
	  INSTANTLY give currently running tasks a shorter time slice,
	  thereby getting to the scheduled in high priority task's queue
	  position much faster.  If you need even faster response you can
	  queue it in the front instead of the back, but I do not reccommend
	  it due to the possibility of locking out other tasks in a runaway
	  situation.

    Now, what about VM paging and thrashing?  Well, that's easy.

	* The task is not on the run queue during a pagein operation.

	* You can avoid thrashing by implementing thrashing management
	  IN THE PAGEIN CODE, NOT IN THE SWITCHER CODE!  Thrash management
	  really does belong in the pagein code.  Thrash management 
	  consists simply of holding off the pagein operation for a period
	  of time when thrashing is occuring.  This causes pagein activity
	  to effectively 'burst' (hopefully with sequential block reads)
	  rather then attempt to run multiple pageins for multiple tasks
	  simultaniously (major seeking).

	* This allows you to implement different schemes based on the type
	  of pagein occuring... i.e. file cache verses executable text
	  page, data page, etc...

    What about fast-waits? A fast wait is when you do not bother to
    deschedule a task for an operation that 'blocks'.  Instead,
    you leave the task in the run-queue and 'poll' for completion, skipping
    the task if the poll returns FALSE.  This is EXTREMELY useful for 
    synchronous IPC applications.  By avoiding the four deschedule/reschedule
    calls that would normally occur for a single IPC transaction you get
    much faster synchronous response when the task is able to run the IPC
    and reply the message in less then a single time slice.  Of course,
    linux doesn't have to worry about that ... yet!

    I don't have time to implement this for linux myself, someone else
    care to?

    Next week, class, I'll describe the I/O subsystem for the embedded OS
    I did ;-)

					-Matt


    Matthew Dillon		dil...@moonshot.west.oic.com
    1005 Apollo Way		dil...@overload.berkeley.ca.us
    Incline Village, NV. 89451	ham: KC6LVW (no mail drop)
    USA 			Sandel-Avery Engineering (702)831-8000
    [always include a portion of the original email in any response!]