Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!
newsfeed.direct.ca!look.ca!newshub2.rdc1.sfba.home.com!news.home.com!
newshub1-work.rdc1.sfba.home.com!gehenna.pell.portland.or.us!
nntp-server.caltech.edu!nntp-server.caltech.edu!mail2news96
Newsgroups: mlist.linux.kernel
Date: 	Fri, 4 Jan 2002 18:05:23 +0100 (CET)
From: Ingo Molnar <mi...@elte.hu>
Reply-To: <mi...@elte.hu>
X-To: <linux-ker...@vger.kernel.org>
X-Cc: Linus Torvalds <torva...@transmeta.com>,
        Alan Cox <a...@lxorguk.ukuu.org.uk>, Anton Blanchard <an...@samba.org>
Subject: [patch] O(1) scheduler, 2.4.17-A1, 2.5.2-pre7-A1.
Message-ID: <linux.kernel.Pine.LNX.4.33.0201041743050.8766-100000@localhost.localdomain>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Approved: n...@nntp-server.caltech.edu
Lines: 66


this is the next release of the O(1) scheduler:

	http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-A1.patch
	http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.4.17-A1.patch

this release includes fixes and small improvements. (The 2.5.2-A1 patch is
against the 2.5.2-pre7 kernel.) I cannot reproduce any more failures with
this patch, but i couldnt test the vfat lockup problem. The X lockup
problem never occured on any of my boxes, but it might be fixed by one of
the changes included in this patch nevertheless.

Changes:

 - idle process notification fixes. This fixes the idle=poll breakage
   reported by Anton Blanchard.

 - fix a bug in setscheduler() which crashed if a non-SCHED_OTHER task did
   a setscheduler() call. This fixes the crash reported by Randy Hron. The
   Linux Test Project's syscall tests do not cause a crash anymore.

 - do some more unlikely()/likely() tagging of branches along the hotpath,
   suggested by Jeff Garzik.

 - fix the compile failures in md.c and loop.c and other files, reported
   by many people.

 - fix the too-big-by-one error in the bitmat sizing define, noticed by
   Anton Blanchard.

 - fix a bug in rt_lock() + setscheduler() that had a potential for a
   spinlock lockup.

 - introduce the idle_tick() function, so that idle CPUs can do
   load-balancing as well.

 - do LINUX_VERSION_CODE checking in jffs2 (Jeff Garzik)

 - optimize the big-kernel-lock releasing/acquiring code some more. From
   now on it's absolutely illegal to schedule() from cli()-ed code. (not
   that it was legal.) This moves a few instructions off the scheduler
   hotpath.

 - move the ->need_resched setting into idle_init().

 - do not clear RT tasks in reparent_to_init(). There's nothing bad with
   running RT tasks in the background.

 - RT task's priority order was inverted, it should be 0-99, not 99-0.

 - make load-balancing a bit less eager when there are lots of processes
   running: it needs a ~10% imbalance in runqueue lengths to trigger a
   rebalance.

 - (there is a small hack in serial.c in the 2.5.2-pre7 patch, to make it
   compile.)

Comments, bug reports, suggestions are welcome,

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!
newsfeed.direct.ca!look.ca!newshub2.rdc1.sfba.home.com!news.home.com!
newshub1-work.rdc1.sfba.home.com!gehenna.pell.portland.or.us!
nntp-server.caltech.edu!nntp-server.caltech.edu!mail2news96
Newsgroups: mlist.linux.kernel
Date: 	Mon, 7 Jan 2002 20:23:41 +0100 (CET)
From: Ingo Molnar <mi...@elte.hu>
Reply-To: <mi...@elte.hu>
X-To: Linus Torvalds <torva...@transmeta.com>
X-Cc: <linux-ker...@vger.kernel.org>, george anzinger <geo...@mvista.com>,
        Davide Libenzi <davi...@xmailserver.org>
Subject: [patch] O(1) scheduler, -D0, 2.5.2-pre9, 2.4.17
Message-ID: <linux.kernel.Pine.LNX.4.33.0201071952270.11688-100000@localhost.localdomain>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Approved: n...@nntp-server.caltech.edu
Lines: 74


i've uploaded an updated O(1) scheduler patch:

 	http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-D0.patch
 	http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.4.17-D0.patch

this release uses Linus' idea of merging RT task priorities into the
normal scheduler priority bitspace. This allowed the removal of all the
ugly RT-related special-case code: the RT and non-RT schedulers are united
again. It's all just one kind of task - an RT task is 'just' a task with
lower priority. The RT locking/unlocking code is completely gone.
rt_schedule() is gone. There is only a single rt_task() branch in the
scheduler hotpaths.

I cannot overemphasize the level of cleanups this enabled. Eg. schedule()
itself has become a very simple, 60 lines long function. If compiled with
a gcc 3.1-ish compiler that knows about likely()/unlikely() the schedule()
function has just a two taken branches in the hotpath! The rest is
straight fall-through code. Altogether, the cleanups reduced sched.c's
source code size by more than 10%!

to enable the fast searching of the 100 + 40 bits bitmap, i've shifted the
SCHED_OTHER bitspace to 128-167. The RT task queues are in bit 0-99. The
100-128 bits are in essence unused. This way the bit-searching can be done
very quickly for the common (no RT) case, on x86:

  static inline int sched_find_first_zero_bit(char *bitmap)
  {
          unsigned int *b = (unsigned int *)bitmap;
          unsigned int rt;

          rt = b[0] & b[1] & b[2] & b[3];
          if (unlikely(rt != 0xffffffff))
                  return find_first_zero_bit(bitmap, MAX_RT_PRIO);

          if (b[4] != ~0)
                  return ffz(b[4]) + MAX_RT_PRIO;
          return ffz(b[5]) + 32 + MAX_RT_PRIO;
  }

also, the layout of the 'normal' task queues is thus cacheline aligned.
(and even in the RT case the find_first_zero_bit() is hand-optimized
assembly code as well.) There is no measurable difference between the
context-switch times of the -C1 patch and this patch, both do 1.57 usecs
on a 466 MHz Celeron.

RT tasks can still be made 'global' at any later point, by doing directed
wakeups towards lower priority CPUs. (The wakeup path has a rt_task()
branch already so there would be no wakeup overhead for normal tasks.)

The patch is stable on my boxes, and two alpha-testers reported that this
patch fixes the crashes they saw with earlier patches.

Changelog:

 - export set_user_nice (Jens Axboe)

 - report correctly scaled priorities via /proc. (this unbreaks 'top'
   priority output.)

 - speeded up the task-load estimator a bit.

 - cleaned up slip.c's and reiserfs/buffer2.c's scheduler usage.

 - lock both runqueues in init_idle(), this could explain some of the
   boot-time SMP crashes Anton saw.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!
newsfeed.direct.ca!look.ca!newshub2.rdc1.sfba.home.com!news.home.com!
newshub1-work.rdc1.sfba.home.com!gehenna.pell.portland.or.us!
nntp-server.caltech.edu!nntp-server.caltech.edu!mail2news96
Newsgroups: mlist.linux.kernel
Date: 	Wed, 9 Jan 2002 19:22:00 +0100 (CET)
From: Ingo Molnar <mi...@elte.hu>
Reply-To: <mi...@elte.hu>
X-To: <linux-ker...@vger.kernel.org>
X-Cc: Linus Torvalds <torva...@transmeta.com>, Mike Kravetz <krav...@US.IBM.COM>,
        Anton Blanchard <an...@samba.org>, george anzinger <geo...@mvista.com>,
        Davide Libenzi <davi...@xmailserver.org>
Subject: [patch] O(1) scheduler, -G1, 2.5.2-pre10, 2.4.17
Message-ID: <linux.kernel.Pine.LNX.4.33.0201091824570.5876-100000@localhost.localdomain>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Approved: n...@nntp-server.caltech.edu
Lines: 115


this is the latest update of the O(1) scheduler:

	http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-pre10-G1.patch

        http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.4.17-G1.patch

this patch contains fixes to the scheduler (mainly from Rusty Russel), and
it also contains a much reworked load-balancing code, triggered by Mike
Kravetz's analysis/numbers.

the previous load-balancer had a number of childhood problems, the biggest
problem was that it rebalanced runqueues way too often. Also, it sometimes
got into a load-balancing resonance. A usual 'make -j15' kernel compile on
an 8-way box generates about 70 thousand reschedules until it finishes.
Under the stock 2.5.2-pre10 kernel, about 20% of those reschedules were
CPU-unaffine, ie. they scheduled to a task that was load-balanced over
into this queue from another CPU.

With 2.5.2-pre10 + -G1, the number of total 'incorrect' reschedules is
down to 0.6%, and even the majority of those is caused by 'idle-pull'
rebalancing: a situation that inevitably causes an unaffine reschedule.
The number of 'unforced' unaffine reschedules is down to 0.2%.

fairness is equally good with both kernels, both the -G1 and the vanilla
kernel distribute CPU-using processes equally well between CPUs.

the new load balancer in the -G1 patch has the following logic:

there are two kinds of load-balancing activities, 'idle balancing' and
'fairness balancing'.

Idle balancing must happen if any CPU runs out of processes - in this case
it must find some new work or else it will stay idle and the CPU power
goes unused.

Fairness rebalancing must happen to even out the runqueues between CPUs -
to avoid a situation where eg. 5 processes are running on one CPU, and 1
process is running on the other CPU - the processes on CPU#0 will only see
20% of single-CPU performance.  The 'fair' distribution is to run 3-3
processes on both CPUs, so each process will get a fair 33% share of
single-CPU performance.

Whenever an idle rebalance situation happens, we try to find a new process
for the soon-to-be-idle CPU. The CPU searches all the other CPUs and takes
processes from the CPU that has the longest runqueue. The idle CPU pulls
only a *single* process - this is the minimum we must do to avoid the CPU
going idle.

Fairness rebalancing happens at a 250 msec pace, which 'rebalance tick'
happens on every CPU, every 250 msecs. In this case we will rebalance
multiple processes as well if needed. A commonly occuring situation is
that processes rush to a runqueue and go off the runqueue quickly. Such
'fluctuations' of runqueue lengths must not result in unnecessery
rebalancing. Thus the fairness rebalancing code uses a (simple & fast)
method of recording the runqueue length on any particular CPU in the last
rebalancing tick. The balancer takes the shorter runqueue length value of
the 'previous' and 'current' length, discarding the longer one as
statistical fluctuation. This mechanizm works pretty well: if a runqueue
is long during a long period of time, then the balancer will 'accept' that
the queue is long and will rebalance it. If the runqueue is only
temporarily long then the load-balancer will not balance it.

in essence the fairness rebalancer establishes an 'average runqueue
length' of sorts by sampling the runqueue length - without adding overhead
to the actual runqueue manipulation code (wake_up() & schedule()). There
exist more accurate methods of sampling runqueue length, but the current
method works pretty well already.

[ there is one possible improvement to this logic that i'll add, it's the
ability of the wakeup code to trigger an idle rebalance. The wakeup code
does not want to trigger a fairness rebalance, the fairness rebalance is
purely timer-driven, ]

anyway, here are some kernel compilation times in seconds, on an 8-way,
Xeon, 700 MHz, 2MB L2 cache box. [lower numbers are better, results are
the best results from 4 successive runs, kernel tree fully cached, exactly
the same kernel tree was compiled under every kernel]:

                                time make -j15 bzImage

2.4.17-vanilla:                 44.6 sec   +- 0.2 sec
2.5.2-pre9-vanilla:             45.3 sec   +- 0.2 sec
2.5.2-pre10-vanilla:            45.4 sec   +- 0.2 sec
2.5.2-pre10-G1:                 43.4 sec   +- 0.2 sec

ie. the -G1 kernel compiles kernels faster than any other kernel i tried.

Changes:

 - Rusty Russell: fix rebalance tick definition if HZ < 100 in UML.

 - Rusty Russell: check for new_mask in set_cpus_allowed(), to be sure.

 - Rusty Russell: clean up rq_ macros so that HT can be done by changing
   just one of the macros.

 - Rusty Russell: remove rq->cpu.

 - Rusty Russell: remove cacheline padding from runqueue_t, it's pointless
   now.

 - Rusty Russell: sched.c comment fixes.

 - increase minimum timeslice length by 10 msec.

 - fix comments in sched.h

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!
newsfeed.direct.ca!look.ca!newsfeed.media.kyoto-u.ac.jp!uio.no!
nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Fri, 11 Jan 2002 01:38:51 +0100 (CET)
From: Ingo Molnar <mi...@elte.hu>
Reply-To: <mi...@elte.hu>
To: Linus Torvalds <torva...@transmeta.com>
Cc: <linux-ker...@vger.kernel.org>
Subject: [patch] O(1) scheduler, -H5
Original-Message-ID: <Pine.LNX.4.33.0201110130290.11478-100000@localhost.localdomain>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Thu, 10 Jan 2002 22:44:29 GMT
Message-ID: <fa.nvkb4lv.1cmsmo6@ifi.uio.no>
Lines: 27


the -H5 patch adds a debugging check:

    http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-pre11-H5.patch

it adds code to catch places that call schedule() from global-cli()
sections. Right now release_kernel_lock() doesnt automatically release the
IRQ lock if there is no kernel lock held. A fair amount of code does this
still, and i think we should fix them in 2.5.

(Such code, while of questionable quality, is safe if it also holds the
big kernel lock, but it's definitely SMP-unsafe it doesnt hold the bkl -
the BUG() assert only catches the later case.)

(Andi Kleen noticed this on the first day the patch was released, and
Andrew Morton reminded me today that i forgot to fix it ... :-| )

my systems do not trigger the BUG(), so there cannot be all that much
broken code left.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Path: archiver1.google.com!news1.google.com!sn-xit-02!supernews.com!
news.tele.dk!small.news.tele.dk!193.213.112.26!newsfeed1.ulv.nextra.no!
nextra.com!uninett.no!uio.no!nntp.uio.no!ifi.uio.no!internet-mailinglist
Newsgroups: fa.linux.kernel
Return-Path: <linux-kernel-ow...@vger.kernel.org>
Original-Date: 	Sun, 13 Jan 2002 20:34:39 +0100 (CET)
From: Ingo Molnar <mi...@elte.hu>
Reply-To: <mi...@elte.hu>
To: <linux-ker...@vger.kernel.org>
Cc: Linus Torvalds <torva...@transmeta.com>, Anton Blanchard <an...@samba.org>
Subject: [patch] O(1) scheduler, -H7
Original-Message-ID: <Pine.LNX.4.33.0201131933500.6560-100000@localhost.localdomain>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-ow...@vger.kernel.org
Precedence: bulk
X-Mailing-List: 	linux-kernel@vger.kernel.org
Organization: Internet mailing list
Date: Sun, 13 Jan 2002 17:38:48 GMT
Message-ID: <fa.o6pdg0v.s52613@ifi.uio.no>
Lines: 34


the -H7 patch is available:

    http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-pre11-H7.patch
    http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.4.17-H7.patch

there is an important SMP fix in this release, found by Anton Blanchard:
double-spin_unlock()ing triggered oopses on high-end SMP boxes.

stability status: all reported problems were fixed by -H6, the only
problem remaining was Anton's SMP crashes, which should be fixed in this
-H7 patch.

Changes between -H6 and -H7:

- Anton Blanchard: fix double spin_unlock in sched.c. This fixes
  a high-end SMP oops he saw.

- William Lee Irwin III: fix mwave's ->nice code.

- cleanup: mmu_context.h renamed to sched.h, as suggested by
  Richard Henderson.

- added a irqs_enabled() macro to the x86 tree, to simplify irq.c.

Bug reports, comments, suggestions welcome.

	Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/