Date: Thu, 15 Mar 2001 02:40:04 +0100
From: Nigel Gamble <ni...@nrg.org>
Reply-To: ni...@nrg.org
Subject: [PATCH for 2.5] preemptible kernel
Message-ID: <Pine.LNX.4.05.10103141653350.3094-100000@cosmic.nrg.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: robo...@news.nic.it
X-Mailing-List: linux-kernel@vger.kernel.org
Approved: robo...@news.nic.it (1.20)
NNTP-Posting-Host: 35637.anti-phl.bofh.it
Newsgroups: linux.kernel
Organization: linux.*_mail_to_news_unidirectional_gateway
Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod
X-Original-Cc: linux-ker...@vger.kernel.org
X-Original-Date: Wed, 14 Mar 2001 17:25:22 -0800 (PST)
X-Original-Sender: linux-kernel-ow...@vger.kernel.org
X-Original-To: Linus Torvalds <torva...@transmeta.com>
Lines: 943

Here is the latest preemptible kernel patch.  It's much cleaner and
smaller than previous versions, so I've appended it to this mail.  This
patch is against 2.4.2, although it's not intended for 2.4.  I'd like
comments from anyone interested in a low-latency Linux kernel solution
for the 2.5 development tree.

Kernel preemption is not allowed while spinlocks are held, which means
that this patch alone cannot guarantee low preemption latencies.  But
as long held locks (in particular the BKL) are replaced by finer-grained
locks, this patch will enable lower latencies as the kernel also becomes
more scalable on large SMP systems.

Notwithstanding the comments in the Configure.help section for
CONFIG_PREEMPT, I think this patch has a negligible effect on
throughput.  In fact, I got better average results from running 'dbench
16' on a 750MHz PIII with 128MB with kernel preemption turned on
(~30MB/s) than on the plain 2.4.2 kernel (~26MB/s).

(I had to rearrange three headers files that are needed in sched.h before
task_struct is defined, but which include inline functions that cannot
now be compiled until after task_struct is defined.  I chose not to
move them into sched.h, like d_path(), as I don't want to make it more
difficult to apply kernel patches to my kernel source tree.)

Nigel Gamble                                    ni...@nrg.org
Mountain View, CA, USA.                         http://www.nrg.org/


Patch

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Message-ID: <m14fHk9-001PKgC@mozart>
From: Rusty Russell <ru...@rustcorp.com.au>
Subject: Re: [PATCH for 2.5] preemptible kernel 
In-Reply-To: Your message of "Wed, 14 Mar 2001 17:25:22 -0800."
             <Pine.LNX.4.05.10103141653350.3094-100000@cosmic.nrg.org> 
Date: Tue, 20 Mar 2001 10:10:04 +0100
Sender: robo...@news.nic.it
X-Mailing-List: linux-kernel@vger.kernel.org
Approved: robo...@news.nic.it (1.20)
NNTP-Posting-Host: 83108.anti-phl.bofh.it
Newsgroups: linux.kernel
Organization: linux.*_mail_to_news_unidirectional_gateway
Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod
References: <Pine.LNX.4.05.10103141653350.3094-100000@cosmic.nrg.org>
X-Original-Cc: linux-ker...@vger.kernel.org
X-Original-Date: Tue, 20 Mar 2001 19:43:50 +1100
X-Original-Sender: linux-kernel-ow...@vger.kernel.org
X-Original-To: ni...@nrg.org
Lines: 80

In message <Pine.LNX.4.05.10103141653350.3094-100...@cosmic.nrg.org> you write:
> Kernel preemption is not allowed while spinlocks are held, which means
> that this patch alone cannot guarantee low preemption latencies.  But
> as long held locks (in particular the BKL) are replaced by finer-grained
> locks, this patch will enable lower latencies as the kernel also becomes
> more scalable on large SMP systems.

Hi Nigel,

	I can see three problems with this approach, only one of which
is serious.

The first is code which is already SMP unsafe is now a problem for
everyone, not just the 0.1% of SMP machines.  I consider this a good
thing for 2.5 though.

The second is that there are "manual" locking schemes which are used
in several places in the kernel which rely on non-preemptability;
de-facto spinlocks if you will.  I consider all these uses flawed: (1)
they are often subtly broken anyway, (2) they make reading those parts
of the code much harder, and (3) they break when things like this are
done.

The third is that preemtivity conflicts with the naive
quiescent-period approach proposed for module unloading in 2.5, and
useful for several other things (eg. hotplugging CPUs).  This method
relies on knowing that when a schedule() has occurred on every CPU, we
know noone is holding certain references.  The simplest example is a
single linked list: you can traverse without a lock as long as you
don't sleep, and then someone can unlink a node, and wait for a
schedule on every other CPU before freeing it.  The non-SMP case is a
noop.  See synchonize_kernel() below.

This, too, is soluble, but it means that synchronize_kernel() must
guarantee that each task which was running or preempted in kernel
space when it was called, has been non-preemtively scheduled before
synchronize_kernel() can exit.  Icky.

Thoughts?
Rusty.
--
Premature optmztion is rt of all evl. --DK

/* We could keep a schedule count for each CPU and make idle tasks
   schedule (some don't unless need_resched), but this scales quite
   well (eg. 64 processors, average time to wait for first schedule =
   jiffie/64.  Total time for all processors = jiffie/63 + jiffie/62...

   At 1024 cpus, this is about 7.5 jiffies.  And that assumes noone
   schedules early. --RR */
void synchronize_kernel(void)
{
	unsigned long cpus_allowed, policy, rt_priority;

	/* Save current state */
	cpus_allowed = current->cpus_allowed;
	policy = current->policy;
	rt_priority = current->rt_priority;

	/* Create an unreal time task. */
	current->policy = SCHED_FIFO;
	current->rt_priority = 1001 + sys_sched_get_priority_max(SCHED_FIFO);

	/* Make us schedulable on all CPUs. */
	current->cpus_allowed = (1UL<<smp_num_cpus)-1;
	
	/* Eliminate current cpu, reschedule */
	while ((current->cpus_allowed &= ~(1 << smp_processor_id())) != 0)
		schedule();

	/* Back to normal. */
	current->cpus_allowed = cpus_allowed;
	current->policy = policy;
	current->rt_priority = rt_priority;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

X-Mailer: exmh version 2.1.1 10/15/1999
From: Keith Owens <k...@ocs.com.au>
Subject: Re: [PATCH for 2.5] preemptible kernel 
In-Reply-To: Your message of "Tue, 20 Mar 2001 19:43:50 +1100."
             <m14fHk9-001PKgC@mozart> 
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Tue, 20 Mar 2001 10:50:03 +0100
Message-ID: <851.985080735@ocs3.ocs-net>
Sender: robo...@news.nic.it
X-Mailing-List: linux-kernel@vger.kernel.org
Approved: robo...@news.nic.it (1.20)
NNTP-Posting-Host: 16738.anti-phl.bofh.it
Newsgroups: linux.kernel
Organization: linux.*_mail_to_news_unidirectional_gateway
Path: supernews.google.com!sn-xit-03!supernews.com!hermes.visi.com!
news-out.visi.com!newspump.sol.net!nntp.msen.com!newsxfer.eecs.umich.edu!
news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!newsfeed00.sul.t-online.de!
t-online.de!bofh.it!robomod
References: <m14fHk9-001PKgC@mozart>
X-Original-Cc: ni...@nrg.org, linux-ker...@vger.kernel.org
X-Original-Date: Tue, 20 Mar 2001 20:32:15 +1100
X-Original-Sender: linux-kernel-ow...@vger.kernel.org
X-Original-To: Rusty Russell <ru...@rustcorp.com.au>
Lines: 23

On Tue, 20 Mar 2001 19:43:50 +1100, 
Rusty Russell <ru...@rustcorp.com.au> wrote:
>The third is that preemtivity conflicts with the naive
>quiescent-period approach proposed for module unloading in 2.5, and
>useful for several other things (eg. hotplugging CPUs).  This method
>relies on knowing that when a schedule() has occurred on every CPU, we
>know noone is holding certain references.
>
>This, too, is soluble, but it means that synchronize_kernel() must
>guarantee that each task which was running or preempted in kernel
>space when it was called, has been non-preemtively scheduled before
>synchronize_kernel() can exit.  Icky.

The preemption patch only allows preemption from interrupt and only for
a single level of preemption.  That coexists quite happily with
synchronize_kernel() which runs in user context.  Just count user
context schedules (preempt_count == 0), not preemptive schedules.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Date: Wed, 21 Mar 2001 02:00:04 +0100
From: Nigel Gamble <ni...@nrg.org>
Reply-To: ni...@nrg.org
Subject: Re: [PATCH for 2.5] preemptible kernel
In-Reply-To: <851.985080735@ocs3.ocs-net>
Message-ID: <Pine.LNX.4.05.10103201625430.26853-100000@cosmic.nrg.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: robo...@news.nic.it
X-Mailing-List: linux-kernel@vger.kernel.org
Approved: robo...@news.nic.it (1.20)
NNTP-Posting-Host: 18895.anti-phl.bofh.it
Newsgroups: linux.kernel
Organization: linux.*_mail_to_news_unidirectional_gateway
Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod
References: <851.985080735@ocs3.ocs-net>
X-Original-Cc: Rusty Russell <ru...@rustcorp.com.au>,
	linux-ker...@vger.kernel.org
X-Original-Date: Tue, 20 Mar 2001 16:48:01 -0800 (PST)
X-Original-Sender: linux-kernel-ow...@vger.kernel.org
X-Original-To: Keith Owens <k...@ocs.com.au>
Lines: 29

On Tue, 20 Mar 2001, Keith Owens wrote:
> The preemption patch only allows preemption from interrupt and only for
> a single level of preemption.  That coexists quite happily with
> synchronize_kernel() which runs in user context.  Just count user
> context schedules (preempt_count == 0), not preemptive schedules.

I'm not sure what you mean by "only for a single level of preemption."
It's possible for a preempting process to be preempted itself by a
higher priority process, and for that process to be preempted by an even
higher priority one, limited only by the number of processes waiting for
interrupt handlers to make them runnable.  This isn't very likely in
practice (kernel preemptions tend to be rare compared to normal calls to
schedule()), but it could happen in theory.

If you're looking at preempt_schedule(), note the call to ctx_sw_off()
only increments current->preempt_count for the preempted task - the
higher priority preempting task that is about to be scheduled will have
a preempt_count of 0.

Nigel Gamble                                    ni...@nrg.org
Mountain View, CA, USA.                         http://www.nrg.org/

MontaVista Software                             ni...@mvista.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

-Mailer: exmh version 2.1.1 10/15/1999
From: Keith Owens <k...@ocs.com.au>
Subject: Re: [PATCH for 2.5] preemptible kernel 
In-Reply-To: Your message of "Tue, 20 Mar 2001 16:48:01 -0800."
             <Pine.LNX.4.05.10103201625430.26853-100000@cosmic.nrg.org> 
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Wed, 21 Mar 2001 02:30:04 +0100
Message-ID: <16074.985137800@kao2.melbourne.sgi.com>
Sender: robo...@news.nic.it
X-Mailing-List: linux-kernel@vger.kernel.org
Approved: robo...@news.nic.it (1.20)
NNTP-Posting-Host: 30003.anti-phl.bofh.it
Newsgroups: linux.kernel
Organization: linux.*_mail_to_news_unidirectional_gateway
Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod
References: <Pine.LNX.4.05.10103201625430.26853-100000@cosmic.nrg.org>
X-Original-Cc: Rusty Russell <ru...@rustcorp.com.au>,
	linux-ker...@vger.kernel.org
X-Original-Date: Wed, 21 Mar 2001 12:23:20 +1100
X-Original-Sender: linux-kernel-ow...@vger.kernel.org
X-Original-To: ni...@nrg.org
Lines: 70

On Tue, 20 Mar 2001 16:48:01 -0800 (PST), 
Nigel Gamble <ni...@nrg.org> wrote:
>On Tue, 20 Mar 2001, Keith Owens wrote:
>> The preemption patch only allows preemption from interrupt and only for
>> a single level of preemption.  That coexists quite happily with
>> synchronize_kernel() which runs in user context.  Just count user
>> context schedules (preempt_count == 0), not preemptive schedules.
>
>If you're looking at preempt_schedule(), note the call to ctx_sw_off()
>only increments current->preempt_count for the preempted task - the
>higher priority preempting task that is about to be scheduled will have
>a preempt_count of 0.

I misread the code, but the idea is still correct.  Add a preemption
depth counter to each cpu, when you schedule and the depth is zero then
you know that the cpu is no longer holding any references to quiesced
structures.

>So, to make sure I understand this, the code to free a node would look
>like:
>
>	prev->next = node->next; /* assumed to be atomic */
>	synchronize_kernel();
>	free(node);
>
>So that any other CPU concurrently traversing the list would see a
>consistent state, either including or not including "node" before the
>call to synchronize_kernel(); but after synchronize_kernel() all other
>CPUs are guaranteed to see a list that no longer includes "node", so it
>is now safe to free it.
>
>It looks like there are also implicit assumptions to this approach, like
>no other CPU is trying to use the same approach simultaneously to free
>"prev".

Not quite.  The idea is that readers can traverse lists without locks,
code that changes the list needs to take a semaphore first.

Read
	node = node->next;

Update
	down(&list_sem);
	prev->next = node->next;
	synchronize_kernel();
	free(node);
	up(&list_sem);

Because the readers have no locks, other cpus can have references to
the node being freed.  The updating code needs to wait until all other
cpus have gone through at least one schedule to ensure that all
dangling references have been flushed.  Adding preemption complicates
this slightly, we have to distinguish between the bottom level schedule
and higher level schedules for preemptive code.  Only when all
preemptive code on a cpu has ended is it safe to say that there are no
dangling references left on that cpu.

This method is a win for high read, low update lists.  Instead of
penalizing the read code every time on the off chance that somebody
will update the data, speed up the common code and penalize the update
code.  The classic example is module code, it is rarely unloaded but
right now everything that *might* be entering a module has to grab the
module spin lock and update the module use count.  So main line code is
being slowed down all the time.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Date: Wed, 21 Mar 2001 04:50:03 +0100
From: Nigel Gamble <ni...@nrg.org>
Reply-To: ni...@nrg.org
Subject: Re: [PATCH for 2.5] preemptible kernel 
In-Reply-To: <16074.985137800@kao2.melbourne.sgi.com>
Message-ID: <Pine.LNX.4.05.10103201920410.26853-100000@cosmic.nrg.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: robo...@news.nic.it
X-Mailing-List: linux-kernel@vger.kernel.org
Approved: robo...@news.nic.it (1.20)
NNTP-Posting-Host: 54379.anti-phl.bofh.it
Newsgroups: linux.kernel
Organization: linux.*_mail_to_news_unidirectional_gateway
Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod
References: <16074.985137800@kao2.melbourne.sgi.com>
X-Original-Cc: Rusty Russell <ru...@rustcorp.com.au>,
	linux-ker...@vger.kernel.org
X-Original-Date: Tue, 20 Mar 2001 19:35:17 -0800 (PST)
X-Original-Sender: linux-kernel-ow...@vger.kernel.org
X-Original-To: Keith Owens <k...@ocs.com.au>
Lines: 21

On Wed, 21 Mar 2001, Keith Owens wrote:
> I misread the code, but the idea is still correct.  Add a preemption
> depth counter to each cpu, when you schedule and the depth is zero then
> you know that the cpu is no longer holding any references to quiesced
> structures.

A task that has been preempted is on the run queue and can be
rescheduled on a different CPU, so I can't see how a per-CPU counter
would work.  It seems to me that you would need a per run queue
counter, like the example I gave in a previous posting.

Nigel Gamble                                    ni...@nrg.org
Mountain View, CA, USA.                         http://www.nrg.org/

MontaVista Software                             ni...@mvista.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Message-ID: <3AB860A8.182A10C7@mvista.com>
Date: Wed, 21 Mar 2001 09:30:04 +0100
From: george anzinger <geo...@mvista.com>
Organization: Monta Vista Software
X-Mailer: Mozilla 4.72 [en] (X11; I; Linux 2.2.12-20b i686)
X-Accept-Language: en
MIME-Version: 1.0
Subject: Re: [PATCH for 2.5] preemptible kernel
References: <Pine.LNX.4.05.10103201920410.26853-100000@cosmic.nrg.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: robo...@news.nic.it
X-Mailing-List: linux-kernel@vger.kernel.org
Approved: robo...@news.nic.it (1.20)
NNTP-Posting-Host: 49545.anti-phl.bofh.it
Newsgroups: linux.kernel
Path: supernews.google.com!sn-xit-02!supernews.com!news.tele.dk!194.25.134.62!
newsfeed00.sul.t-online.de!t-online.de!bofh.it!robomod
X-Original-Cc: Keith Owens <k...@ocs.com.au>,
	Rusty Russell <ru...@rustcorp.com.au>, linux-ker...@vger.kernel.org
X-Original-Date: Wed, 21 Mar 2001 00:04:56 -0800
X-Original-Sender: linux-kernel-ow...@vger.kernel.org
X-Original-To: ni...@nrg.org
Lines: 30

Nigel Gamble wrote:
> 
> On Wed, 21 Mar 2001, Keith Owens wrote:
> > I misread the code, but the idea is still correct.  Add a preemption
> > depth counter to each cpu, when you schedule and the depth is zero then
> > you know that the cpu is no longer holding any references to quiesced
> > structures.
> 
> A task that has been preempted is on the run queue and can be
> rescheduled on a different CPU, so I can't see how a per-CPU counter
> would work.  It seems to me that you would need a per run queue
> counter, like the example I gave in a previous posting.

Exactly so.  The method does not depend on the sum of preemption being
zip, but on each potential reader (writers take locks) passing thru a
"sync point".  Your notion of waiting for each task to arrive
"naturally" at schedule() would work.  It is, in fact, over kill as you
could also add arrival at sys call exit as a (the) "sync point".  In
fact, for module unload, isn't this the real "sync point"?  After all, a
module can call schedule, or did I miss a usage counter somewhere?

By the way, there is a paper on this somewhere on the web.  Anyone
remember where?

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Message-ID: <3AC1BAD3.BBBD97E1@sequent.com>
Date: Wed, 28 Mar 2001 12:30:06 +0200
From: Dipankar Sarma <dipan...@sequent.com>
Organization: IBM Linux Technology Center
X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i686)
X-Accept-Language: en
MIME-Version: 1.0
Subject: Re: [PATCH for 2.5] preemptible kernel
References: <16074.985137800@kao2.melbourne.sgi.com> 
<Pine.LNX.4.05.10103201920410.26853-100000@cosmic.nrg.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: robo...@news.nic.it
X-Mailing-List: linux-kernel@vger.kernel.org
Approved: robo...@news.nic.it (1.20)
NNTP-Posting-Host: 74434.anti-phl.bofh.it
Newsgroups: linux.kernel
Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod
X-Original-Cc: linux-ker...@vger.kernel.org, mcken...@sequent.com
X-Original-Date: Wed, 28 Mar 2001 15:50:03 +0530
X-Original-Sender: linux-kernel-ow...@vger.kernel.org
X-Original-To: ni...@nrg.org
Lines: 51

Nigel Gamble wrote:
> 
> On Wed, 21 Mar 2001, Keith Owens wrote:
> > I misread the code, but the idea is still correct.  Add a preemption
> > depth counter to each cpu, when you schedule and the depth is zero then
> > you know that the cpu is no longer holding any references to quiesced
> > structures.
> 
> A task that has been preempted is on the run queue and can be
> rescheduled on a different CPU, so I can't see how a per-CPU counter
> would work.  It seems to me that you would need a per run queue
> counter, like the example I gave in a previous posting.

Also, a task could be preempted and then rescheduled on the same cpu
making
the depth counter 0 (right ?), but it could still be holding references
to data
structures to be updated using synchronize_kernel(). There seems to be
two
approaches to tackle preemption -

1. Disable pre-emption during the time when references to data
structures 
updated using such Two-phase updates are held.

Pros: easy to implement using a flag (ctx_sw_off() ?)
Cons: not so easy to use since critical sections need to be clearly
identified and interfaces defined. also affects preemptive behavior.

2. In synchronize_kernel(), distinguish between "natural" and preemptive
schedules() and ignore preemptive ones.

Pros: easy to use
Cons: Not so easy to implement. Also a low priority task that keeps
getting
preempted often can affect update side performance significantly.

I intend to experiment with both to understand the impact.

Thanks
Dipankar
-- 
Dipankar Sarma  (dipan...@sequent.com)
IBM Linux Technology Center
IBM Software Lab, Bangalore, India.
Project Page: http://lse.sourceforge.net
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Message-ID: <3AC1CF4B.17B29EA4@sequent.com>
Date: Wed, 28 Mar 2001 14:00:07 +0200
From: Dipankar Sarma <dipan...@sequent.com>
Organization: IBM Linux Technology Center
X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i686)
X-Accept-Language: en
MIME-Version: 1.0
Subject: Re: [PATCH for 2.5] preemptible kernel
References: <Pine.LNX.4.05.10103201920410.26853-100000@cosmic.nrg.org> 
<3AB860A8.182A10C7@mvista.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: robo...@news.nic.it
X-Mailing-List: linux-kernel@vger.kernel.org
Approved: robo...@news.nic.it (1.20)
NNTP-Posting-Host: 93503.anti-phl.bofh.it
Newsgroups: linux.kernel
Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod
X-Original-Cc: linux-ker...@vger.kernel.org, mcken...@sequent.com
X-Original-Date: Wed, 28 Mar 2001 17:17:23 +0530
X-Original-Sender: linux-kernel-ow...@vger.kernel.org
X-Original-To: george anzinger <geo...@mvista.com>
Lines: 55

Hi George,

george anzinger wrote:
> 
> Exactly so.  The method does not depend on the sum of preemption being
> zip, but on each potential reader (writers take locks) passing thru a
> "sync point".  Your notion of waiting for each task to arrive
> "naturally" at schedule() would work.  It is, in fact, over kill as you
> could also add arrival at sys call exit as a (the) "sync point".  In
> fact, for module unload, isn't this the real "sync point"?  After all, a
> module can call schedule, or did I miss a usage counter somewhere?

It is certainly possible to implement synchronize_kernel() like
primitive for two phase update using "sync point". Waiting for
sys call exit will perhaps work in the module unloading case,
but there could be performance issues if a cpu spends most of
its time in idle task/interrupts. synchronize_kernel() provides
a simple generic way of implementing a two phase update without
serialization for reading.

I am working a "sync point" based version of such an approach
available at http://lse.sourceforge.net/locking/rclock.html. It
is based on the original DYNIX/ptx stuff that Paul Mckenney
developed in early 90s. This and synchronize_kernel() are 
very similar in approach and each can be implemented using 
the other.

As for handling preemption, we can perhaps try 2 things -

1. The read side of the critical section is enclosed in
RC_RDPROTECT()/RC_RDUNPROTECT() which are currently nops.
We can disable/enable preemption using these.

2. Avoid counting preemptive context switches. I am not sure
about this one though.

> 
> By the way, there is a paper on this somewhere on the web.  Anyone
> remember where?

If you are talking about Paul's paper, the link is
http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf.

Thanks
Dipankar
-- 
Dipankar Sarma  (dipan...@sequent.com)
IBM Linux Technology Center
IBM Software Lab, Bangalore, India.
Project Page: http://lse.sourceforge.net
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Message-ID: <3AC24EB6.1F0DD551@mvista.com>
Date: Wed, 28 Mar 2001 23:10:03 +0200
From: george anzinger <geo...@mvista.com>
Organization: Monta Vista Software
X-Mailer: Mozilla 4.72 [en] (X11; I; Linux 2.2.12-20b i686)
X-Accept-Language: en
MIME-Version: 1.0
Subject: Re: [PATCH for 2.5] preemptible kernel
References: <16074.985137800@kao2.melbourne.sgi.com> 
<Pine.LNX.4.05.10103201920410.26853-100000@cosmic.nrg.org> <3AC1BAD3.BBBD97E1@sequent.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: robo...@news.nic.it
X-Mailing-List: linux-kernel@vger.kernel.org
Approved: robo...@news.nic.it (1.20)
NNTP-Posting-Host: 77794.anti-phl.bofh.it
Newsgroups: linux.kernel
Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod
X-Original-Cc: ni...@nrg.org, linux-ker...@vger.kernel.org, mcken...@sequent.com
X-Original-Date: Wed, 28 Mar 2001 12:51:02 -0800
X-Original-Sender: linux-kernel-ow...@vger.kernel.org
X-Original-To: Dipankar Sarma <dipan...@sequent.com>
Lines: 55

Dipankar Sarma wrote:
> 
> Nigel Gamble wrote:
> >
> > On Wed, 21 Mar 2001, Keith Owens wrote:
> > > I misread the code, but the idea is still correct.  Add a preemption
> > > depth counter to each cpu, when you schedule and the depth is zero then
> > > you know that the cpu is no longer holding any references to quiesced
> > > structures.
> >
> > A task that has been preempted is on the run queue and can be
> > rescheduled on a different CPU, so I can't see how a per-CPU counter
> > would work.  It seems to me that you would need a per run queue
> > counter, like the example I gave in a previous posting.
> 
> Also, a task could be preempted and then rescheduled on the same cpu
> making
> the depth counter 0 (right ?), but it could still be holding references
> to data
> structures to be updated using synchronize_kernel(). There seems to be
> two
> approaches to tackle preemption -
> 
> 1. Disable pre-emption during the time when references to data
> structures
> updated using such Two-phase updates are held.

Doesn't this fly in the face of the whole Two-phase system?  It seems to
me that the point was to not require any locks.  Preemption disable IS a
lock.  Not as strong as some, but a lock none the less.
> 
> Pros: easy to implement using a flag (ctx_sw_off() ?)
> Cons: not so easy to use since critical sections need to be clearly
> identified and interfaces defined. also affects preemptive behavior.
> 
> 2. In synchronize_kernel(), distinguish between "natural" and preemptive
> schedules() and ignore preemptive ones.
> 
> Pros: easy to use
> Cons: Not so easy to implement. Also a low priority task that keeps
> getting
> preempted often can affect update side performance significantly.

Actually is is fairly easy to distinguish the two (see TASK_PREEMPTED in
state).  Don't you also have to have some sort of task flag that
indicates that the task is one that needs to sync?  Something that gets
set when it enters the area of interest and cleared when it hits the
sync point?  

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Date: Thu, 29 Mar 2001 11:50:03 +0200
From: Dipankar Sarma <dipan...@sequent.com>
Subject: Re: [PATCH for 2.5] preemptible kernel
Message-ID: <20010329151330.A7361@in.ibm.com>
Reply-To: dipan...@sequent.com
References: <16074.985137800@kao2.melbourne.sgi.com> 
<Pine.LNX.4.05.10103201920410.26853-100000@cosmic.nrg.org> 
<3AC1BAD3.BBBD97E1@sequent.com> <3AC24EB6.1F0DD551@mvista.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 1.0.1i
In-Reply-To: <3AC24EB6.1F0DD551@mvista.com>; from george@mvista.com on Wed, 
Mar 28, 2001 at 12:51:02PM -0800
Sender: robo...@news.nic.it
X-Mailing-List: linux-kernel@vger.kernel.org
Approved: robo...@news.nic.it (1.20)
NNTP-Posting-Host: 52692.anti-phl.bofh.it
Newsgroups: linux.kernel
Organization: linux.*_mail_to_news_unidirectional_gateway
Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod
X-Original-Cc: Dipankar Sarma <dipan...@sequent.com>, ni...@nrg.org,
	linux-ker...@vger.kernel.org, mcken...@sequent.com
X-Original-Date: Thu, 29 Mar 2001 15:13:30 +0530
X-Original-Sender: linux-kernel-ow...@vger.kernel.org
X-Original-To: george anzinger <geo...@mvista.com>
Lines: 75

On Wed, Mar 28, 2001 at 12:51:02PM -0800, george anzinger wrote:
> Dipankar Sarma wrote:
> > 
> > Also, a task could be preempted and then rescheduled on the same cpu
> > making
> > the depth counter 0 (right ?), but it could still be holding references
> > to data
> > structures to be updated using synchronize_kernel(). There seems to be
> > two
> > approaches to tackle preemption -
> > 
> > 1. Disable pre-emption during the time when references to data
> > structures
> > updated using such Two-phase updates are held.
> 
> Doesn't this fly in the face of the whole Two-phase system?  It seems to
> me that the point was to not require any locks.  Preemption disable IS a
> lock.  Not as strong as some, but a lock none the less.

The point is to avoid acquring costly locks, so it is a question of 
relative cost. Disabling preemption (by an atomic increment) for 
short critical sections may not be as bad as spin-waiting for 
highly contended locks or thrashing lock cachelines.


> > 
> > Pros: easy to implement using a flag (ctx_sw_off() ?)
> > Cons: not so easy to use since critical sections need to be clearly
> > identified and interfaces defined. also affects preemptive behavior.
> > 
> > 2. In synchronize_kernel(), distinguish between "natural" and preemptive
> > schedules() and ignore preemptive ones.
> > 
> > Pros: easy to use
> > Cons: Not so easy to implement. Also a low priority task that keeps
> > getting
> > preempted often can affect update side performance significantly.
> 
> Actually is is fairly easy to distinguish the two (see TASK_PREEMPTED in
> state).  Don't you also have to have some sort of task flag that
> indicates that the task is one that needs to sync?  Something that gets
> set when it enters the area of interest and cleared when it hits the
> sync point?  

None of the two two-phase update implementations (synchronize_kernel())
by Rusty and read-copy update by us, monitor the tasks that require
sync for update. synchronize_kernel() forces a schedule on every
cpu and read-copy update waits until every cpu goes through
a quiscent state, before updating. Both approaches will require
significant special handling because they implicitly assume 
that tasks inside the kernel are bound to the current cpu until it
reaches a quiescent state (like a "normal"
context switch). Since preempted tasks can run later on any cpu, we
will have to keep track of sync points on a per-task basis and
that will probably require using a snapshot of the running
tasks from the global runqueue. That may not be a good thing
from performance standpoint, not to mention the complexity.

Also, in situations where read-to-write ratio is not heavily
skewed towards read or lots of updates happening, a very low
priority task can have a significant impact on performance
by getting preempted all the time.

Thanks
Dipankar
-- 
Dipankar Sarma  (dipan...@sequent.com)
IBM Linux Technology Center
IBM Software Lab, Bangalore, India.
Project Page: http://lse.sourceforge.net
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/