get_vm_area and GFP

From: David Grothe <d...@gcom.com>
Subject: get_vm_area and GFP_KERNEL
Date: 1998/10/22
Message-ID: <fa.dvn9shv.n6ou9q@ifi.uio.no>#1/1
X-Deja-AN: 403997034
Original-Date: Thu, 22 Oct 1998 12:33:27 -0500
Sender: owner-linux-ker...@vger.rutgers.edu
Content-Transfer-Encoding: 7bit
Original-Message-ID: <362F6C67.96A22EE6@gcom.com>
To: Linux Kernel <linux-ker...@vger.rutgers.edu>, Alan Cox <a...@lxorguk.ukuu.org.uk>
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu
Organization: Gcom, Inc
MIME-Version: 1.0
Reply-To: d...@gcom.com
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Hello:

The routine get_vm_area calls kmalloc passing GFP_KERNEL.  Get_vm_area
is called from vremap (2.0 kernels) and ioremap (2.1 kernels).  These
routines are used to map physical memory (as in PCI device registers) to
kernel address space.

The problem:  If these routines are called from a bottom half task queue
routine then kmalloc thinks that it is being called from interrupt level
without GFP_ATOMIC and prints a message to that effect.

Would any harm be done by changing this?

These would be the patches:

Kernel version 2.0.36:

--- vmalloc.c   Wed Jun  3 17:17:50 1998
+++ vmalloc.new.c       Thu Oct 22 12:25:41 1998
@@ -239,7 +239,7 @@
        void *addr;
        struct vm_struct **p, *tmp, *area;

-       area = (struct vm_struct *) kmalloc(sizeof(*area), GFP_KERNEL);
+       area = (struct vm_struct *) kmalloc(sizeof(*area), GFP_ATOMIC);
        if (!area)
                return NULL;
        addr = (void *) VMALLOC_START;

Kernel version 2.1.120:

--- vmalloc.c   Fri Sep 18 15:43:33 1998
+++ vmalloc.new.c       Thu Oct 22 12:24:35 1998
@@ -155,7 +155,7 @@
        unsigned long addr;
        struct vm_struct **p, *tmp, *area;

-       area = (struct vm_struct *) kmalloc(sizeof(*area), GFP_KERNEL);
+       area = (struct vm_struct *) kmalloc(sizeof(*area), GFP_ATOMIC);
        if (!area)
                return NULL;
        addr = VMALLOC_START;

To the casual observer, it looks safe.  Will someone make these
modifications to the kernel source?

Thanks,
Dave


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: "Benjamin C.R. LaHaise" <b...@kvack.org>
Subject: Re: get_vm_area and GFP_KERNEL
Date: 1998/10/22
Message-ID: <fa.k0m9mtv.f5i1is@ifi.uio.no>#1/1
X-Deja-AN: 404034403
Original-Date: Thu, 22 Oct 1998 15:41:37 -0400 (EDT)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.95.981022152401.20122A-100000@as200.spellcast.com>
References: <fa.dvn9shv.n6ou9q@ifi.uio.no>
To: David Grothe <d...@gcom.com>
X-Sender: b...@as200.spellcast.com
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Thu, 22 Oct 1998, David Grothe wrote:

...
> The problem:  If these routines are called from a bottom half task queue
> routine then kmalloc thinks that it is being called from interrupt level
> without GFP_ATOMIC and prints a message to that effect.

But who is calling ioremap or vremap from a bottom half?  That isn't
permitted!  Device drivers are supposed to setup their mappings ahead of
time, right?

> Would any harm be done by changing this?

Yes, reliability of vmalloc/ioremap would be degraded, whereas many of the
callers expect the system to make a reasonable attempt to perform the
operation (we all know about the sound & floppy driver transient failures
due to DMA).  If such a hack must be added, it should use something like
the gfp_any macro.

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: David Grothe <d...@gcom.com>
Subject: Re: get_vm_area and GFP_KERNEL
Date: 1998/12/30
Message-ID: <fa.dso5pfv.l1c2q5@ifi.uio.no>#1/1
X-Deja-AN: 426982918
Original-Date: Tue, 29 Dec 1998 15:48:01 -0600
Sender: owner-linux-ker...@vger.rutgers.edu
Content-Transfer-Encoding: 7bit
Original-Message-ID: <36894E11.C5334911@gcom.com>
References: <fa.k0m9mtv.f5i1is@ifi.uio.no>
To: "Benjamin C.R. LaHaise" <b...@kvack.org>
Original-References: <Pine.LNX.3.95.981022152401.20122A-100...@as200.spellcast.com>
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu
Organization: Gcom, Inc
MIME-Version: 1.0
Reply-To: d...@gcom.com
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Benjamin C.R. LaHaise wrote:

> On Thu, 22 Oct 1998, David Grothe wrote:
>
> ...
> > The problem:  If these routines are called from a bottom half task queue
> > routine then kmalloc thinks that it is being called from interrupt level
> > without GFP_ATOMIC and prints a message to that effect.
>
> But who is calling ioremap or vremap from a bottom half?  That isn't
> permitted!  Device drivers are supposed to setup their mappings ahead of
> time, right?

Not in my case.  I have drivers that execute queued requests from a bottom
half routine.  The driver accepts configuration information from a user-space
utility that runs long after boot time.  The configuration information
specifies such hardware parameters as IRQ and shared memory address.

I don't understand what the problem is here.  When get_vm_area is called with
GFP_KERNEL and it finds itself "at interrupt level", it simply changes the
priority parameter to GFP_ATOMIC and proceeds with the request.  The
allocation routines seem to have code in them to protect critical sections
from being interrupted.

So why have the restriction on when device memory can be mapped?  Or why not
create a variant of ioremap/vremap with a "priority" parameter so that I can
have some control over this?

Thanks,
Dave

>
>
> > Would any harm be done by changing this?
>
> Yes, reliability of vmalloc/ioremap would be degraded, whereas many of the
> callers expect the system to make a reasonable attempt to perform the
> operation (we all know about the sound & floppy driver transient failures
> due to DMA).  If such a hack must be added, it should use something like
> the gfp_any macro.

If so, then why does get_vm_area simply change  the value of the priority
parameter and proceed with the request?

-- Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: David Grothe <d...@gcom.com>
Subject: get_vm_area needs spin lock
Date: 1998/12/31
Message-ID: <fa.een5qhv.n681ac@ifi.uio.no>#1/1
X-Deja-AN: 427334243
Original-Date: Wed, 30 Dec 1998 10:27:57 -0600
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <368A548D.23BBF8DE@gcom.com>
References: <fa.k0m9mtv.f5i1is@ifi.uio.no>
To: "Benjamin C.R. LaHaise" <b...@kvack.org>
Original-References: <Pine.LNX.3.95.981022152401.20122A-100...@as200.spellcast.com>
Content-Type: multipart/mixed; boundary="------------D18E65256A1F34E347650765"
X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu
Organization: Gcom, Inc
MIME-Version: 1.0
Reply-To: d...@gcom.com
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

This is a multi-part message in MIME format.
--------------D18E65256A1F34E347650765
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Kernel hackers:

As a result of some correspondence concerning the routine get_vm_area, I
have been taking a close look at that routine.  It seems to me that in
an SMP environment this routine needs spin lock protection around the
loop that runs down the linked list of areas.  If two or more CPUs are
executing this routine simultaneously the list could get tangled.

I am enclosing a patch for 2.2.0pre1 (mm/vmalloc.c) that I think fixes
this problem.  The patch is enclosed as a MIME attachment to avoid
line-wrapping problems with cutting/pasting into an e-mail.

I put in spin lock protection for all cases of traversing the linked
list.  Someone who knows the code better could probably find shorter
spans of code to protect.

This problem would not be present in a 2.0 kernel since the SMP
technique there is to protect the entire kernel with one lock.

-- Dave
--------------D18E65256A1F34E347650765
Content-Type: text/plain; charset=us-ascii; name="vmalloc.c.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline; filename="vmalloc.c.patch"

--- vmalloc.c.old	Fri Nov 20 13:43:19 1998
+++ vmalloc.c	Wed Dec 30 09:50:14 1998
@@ -10,6 +10,7 @@
 #include <asm/uaccess.h>

 static struct vm_struct * vmlist = NULL;
+static spinlock_t	  vmlist_spinlock;

 static inline void free_area_pte(pmd_t * pmd, unsigned long address, unsigned long size)
 {
@@ -152,28 +153,34 @@
 struct vm_struct * get_vm_area(unsigned long size)
 {
 	unsigned long addr;
+	unsigned long save_flags;
 	struct vm_struct **p, *tmp, *area;

-	area = (struct vm_struct *) kmalloc(sizeof(*area), GFP_KERNEL);
+	area = (struct vm_struct *) kmalloc(sizeof(*area), GFP_ATOMIC);
 	if (!area)
 		return NULL;
 	addr = VMALLOC_START;
+	spin_lock_irqsave(&vmlist_spinlock, save_flags);
 	for (p = &vmlist; (tmp = *p) ; p = &tmp->next) {
 		if (size + addr < (unsigned long) tmp->addr)
 			break;
-		if (addr > VMALLOC_END-size)
+		if (addr > VMALLOC_END-size) {
+			spin_unlock_irqrestore(&vmlist_spinlock, save_flags);
 			return NULL;
+		}
 		addr = tmp->size + (unsigned long) tmp->addr;
 	}
 	area->addr = (void *)addr;
 	area->size = size + PAGE_SIZE;
 	area->next = *p;
 	*p = area;
+	spin_unlock_irqrestore(&vmlist_spinlock, save_flags);
 	return area;
 }

 void vfree(void * addr)
 {
+	unsigned long save_flags;
 	struct vm_struct **p, *tmp;

 	if (!addr)
@@ -182,14 +189,17 @@
 		printk("Trying to vfree() bad address (%p)\n", addr);
 		return;
 	}
+	spin_lock_irqsave(&vmlist_spinlock, save_flags);
 	for (p = &vmlist ; (tmp = *p) ; p = &tmp->next) {
 		if (tmp->addr == addr) {
 			*p = tmp->next;
+			spin_unlock_irqrestore(&vmlist_spinlock, save_flags);
 			vmfree_area_pages(VMALLOC_VMADDR(tmp->addr), tmp->size);
 			kfree(tmp);
 			return;
 		}
 	}
+	spin_unlock_irqrestore(&vmlist_spinlock, save_flags);
 	printk("Trying to vfree() nonexistent vm area (%p)\n", addr);
 }

@@ -217,11 +227,13 @@
 	struct vm_struct *tmp;
 	char *vaddr, *buf_start = buf;
 	unsigned long n;
+	unsigned long save_flags;

 	/* Don't allow overflow */
 	if ((unsigned long) addr + count < count)
 		count = -(unsigned long) addr;

+	spin_lock_irqsave(&vmlist_spinlock, save_flags);
 	for (tmp = vmlist; tmp; tmp = tmp->next) {
 		vaddr = (char *) tmp->addr;
 		if (addr >= vaddr + tmp->size - PAGE_SIZE)
@@ -245,5 +257,6 @@
 		} while (--n > 0);
 	}
 finished:
+	spin_unlock_irqrestore(&vmlist_spinlock, save_flags);
 	return buf - buf_start;
 }

--------------D18E65256A1F34E347650765--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: torva...@transmeta.com (Linus Torvalds)
Subject: Re: get_vm_area needs spin lock
Date: 1998/12/31
Message-ID: <fa.i1u35ev.1dmee9r@ifi.uio.no>#1/1
X-Deja-AN: 427474451
Original-Date: 31 Dec 1998 06:54:13 GMT
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <76f72l$aqi$1@palladium.transmeta.com>
References: <fa.een5qhv.n681ac@ifi.uio.no>
To: linux-ker...@vger.rutgers.edu
Original-References: <Pine.LNX.3.95.981022152401.20122A-100...@as200.spellcast.com> 
<368A548D.23BBF...@gcom.com>
X-Authentication-Warning: palladium.transmeta.com: bin set sender to 
n...@transmeta.com using -f
X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu
Organization: Transmeta Corporation, Santa Clara, CA
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

In article <368A548D.23BBF...@gcom.com>, David Grothe  <d...@gcom.com> wrote:
>
>As a result of some correspondence concerning the routine get_vm_area, I
>have been taking a close look at that routine.  It seems to me that in
>an SMP environment this routine needs spin lock protection around the
>loop that runs down the linked list of areas.  If two or more CPUs are
>executing this routine simultaneously the list could get tangled.

This should never be the case: vmalloc() should not be called without
holding the kernel lock. 

Whether that is actually the case right now is not something I have
verified, but I would be surprised if it isn't. As such, 2.1.x is in the
same position as 2.0.x in this regard.

Has there actually been problems that caused you to look at this?

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: Re: get_vm_area needs spin lock
Date: 1999/01/01
Message-ID: <fa.iq0je6v.1dgmph7@ifi.uio.no>#1/1
X-Deja-AN: 427607064
Original-Date: Thu, 31 Dec 1998 18:56:14 +0100 (CET)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.981231183721.658B-100000@laser.bogus>
References: <fa.een5qhv.n681ac@ifi.uio.no>
To: David Grothe <d...@gcom.com>
X-Sender: and...@laser.bogus
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-ker...@vger.rutgers.edu
X-PgP-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Wed, 30 Dec 1998, David Grothe wrote:

> I am enclosing a patch for 2.2.0pre1 (mm/vmalloc.c) that I think fixes
> this problem.  The patch is enclosed as a MIME attachment to avoid

There is no need for irqsave since vmalloc can be used only from a normal
kernel context (noirq and nobh). And there's no reason for GFP_ATOMIC I
think. So I changed a bit your patch this way:

Index: linux/mm/vmalloc.c
diff -u linux/mm/vmalloc.c:1.1.1.2 linux/mm/vmalloc.c:1.1.1.1.2.3
--- linux/mm/vmalloc.c:1.1.1.2	Fri Nov 27 11:19:11 1998
+++ linux/mm/vmalloc.c	Thu Dec 31 18:55:11 1998
@@ -10,6 +10,7 @@
 #include <asm/uaccess.h>
 
 static struct vm_struct * vmlist = NULL;
+static spinlock_t	  vmlist_spinlock;
 
 static inline void free_area_pte(pmd_t * pmd, unsigned long address, unsigned long size)
 {
@@ -158,17 +159,21 @@
 	if (!area)
 		return NULL;
 	addr = VMALLOC_START;
+	spin_lock(&vmlist_spinlock);
 	for (p = &vmlist; (tmp = *p) ; p = &tmp->next) {
 		if (size + addr < (unsigned long) tmp->addr)
 			break;
-		if (addr > VMALLOC_END-size)
+		if (addr > VMALLOC_END-size) {
+			spin_unlock(&vmlist_spinlock);
 			return NULL;
+		}
 		addr = tmp->size + (unsigned long) tmp->addr;
 	}
 	area->addr = (void *)addr;
 	area->size = size + PAGE_SIZE;
 	area->next = *p;
 	*p = area;
+	spin_unlock(&vmlist_spinlock);
 	return area;
 }
 
@@ -182,14 +187,18 @@
 		printk("Trying to vfree() bad address (%p)\n", addr);
 		return;
 	}
+	spin_lock(&vmlist_spinlock);
 	for (p = &vmlist ; (tmp = *p) ; p = &tmp->next) {
 		if (tmp->addr == addr) {
 			*p = tmp->next;
-			vmfree_area_pages(VMALLOC_VMADDR(tmp->addr), tmp->size);
+			spin_unlock(&vmlist_spinlock);
+			vmfree_area_pages(VMALLOC_VMADDR(tmp->addr),
+					  tmp->size - PAGE_SIZE);
 			kfree(tmp);
 			return;
 		}
 	}
+	spin_unlock(&vmlist_spinlock);
 	printk("Trying to vfree() nonexistent vm area (%p)\n", addr);
 }
 
@@ -222,6 +231,7 @@
 	if ((unsigned long) addr + count < count)
 		count = -(unsigned long) addr;
 
+	spin_lock(&vmlist_spinlock);
 	for (tmp = vmlist; tmp; tmp = tmp->next) {
 		vaddr = (char *) tmp->addr;
 		if (addr >= vaddr + tmp->size - PAGE_SIZE)
@@ -245,5 +255,6 @@
 		} while (--n > 0);
 	}
 finished:
+	spin_unlock(&vmlist_spinlock);
 	return buf - buf_start;
 }




Andrea Arcangeli


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/