From: l...@neteng.engr.sgi.com (Larry McVoy)
Subject: Re: Some matters on the new verify_area
Date: 1996/10/11
Message-ID: <53kt8g$juf@fido.asd.sgi.com>#1/1
X-Deja-AN: 188684072
references: <m0vBVQB-0005FdC@lightning.swansea.linux.org.uk>
x-submitted-via: n...@ratatosk.yggdrasil.com (linux.* gateway)
x-hdr-sender: l...@neteng.engr.sgi.com 
organization: Silicon Graphics Inc., Mountain View, CA
x-env-sender: n...@fido.asd.sgi.com
newsgroups: linux.dev.kernel


Alan Cox (a...@lxorguk.ukuu.org.uk) wrote:
: 2.	It causes some extremely hard to solve problems in meeting both written
: and the unwritten API specification of Unix. You get strange things like
: a partial write of data returning EFAULT even though data was written. That
: breaks stuff. In addition subtle bugs in the exception handling will leak
: resources.

This breaks POSIX.1.  See IEEE Std 1003.1-1988, page 114, paragraph 1.
I also think that Alan is probably onto something in the rest of the message
which I didn't  quote here.

Linus, any chance of revisiting this issue?
--
---
Larry McVoy     l...@sgi.com     http://reality.sgi.com/lm     (415) 933-1804

From: Linus Torvalds <torva...@cs.Helsinki.FI>
Subject: Re: Some matters on the new verify_area
Date: 1996/10/11
Message-ID: <Pine.LNX.3.91.961011142512.28692A-100000@linux.cs.Helsinki.FI>#1/1
X-Deja-AN: 188713750
references: <53kt8g$juf@fido.asd.sgi.com>
x-submitted-via: n...@ratatosk.yggdrasil.com (linux.* gateway)
content-type: TEXT/PLAIN; charset=US-ASCII
x-hdr-sender: torva...@cs.Helsinki.FI
mime-version: 1.0
x-env-sender: torva...@cs.Helsinki.FI
newsgroups: linux.dev.kernel




On 11 Oct 1996, Larry McVoy wrote:
> 
> Alan Cox (a...@lxorguk.ukuu.org.uk) wrote:
> : 2.	It causes some extremely hard to solve problems in meeting both written
> : and the unwritten API specification of Unix. You get strange things like
> : a partial write of data returning EFAULT even though data was written. That
> : breaks stuff. In addition subtle bugs in the exception handling will leak
> : resources.
> 
> This breaks POSIX.1.  See IEEE Std 1003.1-1988, page 114, paragraph 1.
> I also think that Alan is probably onto something in the rest of the message
> which I didn't  quote here.
> 
> Linus, any chance of revisiting this issue?

My next version of this will have a better interface, and that includes
partial reads/writes. The current exception() handling was the "simple"  way
to do it: it's ugly and hard to use, but the low-level implementation is very
simple. I have a better interface already, but it's a not-so-small matter of
actually getting it working correctly ;)

With the new interface you can do just:

	bytes_unwritten = copy_to_user(user_buf, kernel_buf, nr);

and it does all the checking and exception handling for you. The hard part is
not actually getting it to work, but to get it to work as quickly as I want
it to ;)

		Linus

From: Linus Torvalds <torva...@cs.helsinki.fi>
Subject: Re: Some matters on the new verify_area
Date: 1996/10/11
Message-ID: <Pine.LNX.3.91.961011185153.3395A-100000@linux.cs.Helsinki.FI>#1/1
X-Deja-AN: 188763427
sender: owner-linux-ker...@vger.rutgers.edu
references: <199610111258.NAA28937@oberon.di.fc.ul.pt>
content-type: TEXT/PLAIN; charset=US-ASCII
x-hdr-sender: torva...@cs.helsinki.fi
mime-version: 1.0
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel




On Fri, 11 Oct 1996, Pedro Roque wrote:
> 
> Linus,
> it still doesn't work for the network stack at least...

Actually, it _does_ work for the network stack.

Even just the old "exception()" interface worked fine for the network stack,
although the patches to do so weren't exactly pretty (one reason I decided
the exception handling needs some work is that it was hard to use and I
worried about the compiler doing things to the code that made exceptions
unreliable). 

But yes, the code does need some changes. I did the changes for the TCP send
side in 2.1.3, you can look at what I did there..

> We need to be able to either write the user buffer or reject all of it, in
> a transaction like way. The problem is that if one gets an exception half-way
> throught some skbs where already sent or part of the data is on send queues...

Exceptions aren't totally asychnronous. In fact, they are totally 
synchronous wrt user level accesses, so the problem isn't all that large.

> Now, the old verify area, unmodified, doesn't work either since the network
> stack can sleep from the verify point till it actually uses the buffer...

Indeed. The optimizations Alan suggested (to keep the old verify_area) simply
do not work in a threaded environment. 

> It there anyway to pin down the mapping on a verify ?

Efficiently? No. Believe me, I've been thinking about this, and the only
efficient and thread-safe way to handle this is exceptions. But the bare
exceptions exposed by 2.1.3 are certainly a bit rough in the edges. 

		Linus

From: Pedro Roque <ro...@di.fc.ul.pt>
Subject: Re: Some matters on the new verify_area
Date: 1996/10/11
Message-ID: <199610111613.RAA29632@oberon.di.fc.ul.pt>#1/1
X-Deja-AN: 188908845
sender: owner-linux-ker...@vger.rutgers.edu
x-hdr-sender: ro...@di.fc.ul.pt
references: <199610111258.NAA28937@oberon.di.fc.ul.pt>
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel


>>>>> "Linus" == Linus Torvalds <torva...@cs.Helsinki.FI> writes:

    Linus> On Fri, 11 Oct 1996, Pedro Roque wrote:
    >>  Linus, it still doesn't work for the network stack at least...

    Linus> Actually, it _does_ work for the network stack.

    Linus> Even just the old "exception()" interface worked fine for
    Linus> the network stack, although the patches to do so weren't
    Linus> exactly pretty (one reason I decided the exception handling
    Linus> needs some work is that it was hard to use and I worried
    Linus> about the compiler doing things to the code that made
    Linus> exceptions unreliable).

    Linus> But yes, the code does need some changes. I did the changes
    Linus> for the TCP send side in 2.1.3, you can look at what I did
    Linus> there..

Linus, it is all a question of semantics...

both TCP and UDP sends do:

while(data from user)
{
	copy_from_user...

	send packet to the network
	
	if (maybe)
		sleep();	/* usually waiting for memory */
}

now what do you want Linux to do if on the second or third copy fails ?

Just retuning -EFAULT seams inapropriate, to me.


    >> We need to be able to either write the user buffer or reject
    >> all of it, in a transaction like way. The problem is that if
    >> one gets an exception half-way throught some skbs where already
    >> sent or part of the data is on send queues...

    Linus> Exceptions aren't totally asychnronous. In fact, they are
    Linus> totally synchronous wrt user level accesses, so the problem
    Linus> isn't all that large.

Rolling back all the processing is usually not possible.
And yes, i tryed to think on ways of building skb_queues on send(), before 
things get passed to the IP level... it would be extremly slow and
cumbersome.

    >> Now, the old verify area, unmodified, doesn't work either since
    >> the network stack can sleep from the verify point till it
    >> actually uses the buffer...

    Linus> Indeed. The optimizations Alan suggested (to keep the old
    Linus> verify_area) simply do not work in a threaded environment.

    >> It there anyway to pin down the mapping on a verify ?

    Linus> Efficiently? No. Believe me, I've been thinking about this,
    Linus> and the only efficient and thread-safe way to handle this
    Linus> is exceptions. But the bare exceptions exposed by 2.1.3 are
    Linus> certainly a bit rough in the edges.

That is not what I am worried about...

Tell us what are the right semantics you want for the case of other than the
first of multiple reads from user fails and we'll code it up...

regards,
  Pedro.

From: Linus Torvalds <torva...@cs.helsinki.fi>
Subject: Re: Some matters on the new verify_area
Date: 1996/10/11
Message-ID: <Pine.LNX.3.91.961011195618.5160A-100000@linux.cs.Helsinki.FI>#1/1
X-Deja-AN: 188932035
sender: owner-linux-ker...@vger.rutgers.edu
references: <199610111613.RAA29632@oberon.di.fc.ul.pt>
content-type: TEXT/PLAIN; charset=US-ASCII
x-hdr-sender: torva...@cs.helsinki.fi
mime-version: 1.0
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel




On Fri, 11 Oct 1996, Pedro Roque wrote:
> 
> Linus, it is all a question of semantics...
> 
> both TCP and UDP sends do:
> 
> while(data from user)
> {
> 	copy_from_user...
> 
> 	send packet to the network
> 	
> 	if (maybe)
> 		sleep();	/* usually waiting for memory */
> }
> 
> now what do you want Linux to do if on the second or third copy fails ?
> 
> Just retuning -EFAULT seams inapropriate, to me.

Sure. That's why some mods are required. But it most of the code 
_already_ does things like

	if (!written)
		written = error;
	return written;

(so an error like EFAULT is only returned if the _first_ copy fails).

Anyway, it's not hard to do things like that. In the future, if you give 
a buffer that is partially valid, Linux will use up as much as possible 
of the buffer, and return the "used up" portion. Admittedly 2.1.3 doesn't 
do that, but it's not actually very hard to fix (my personal kernel 
already does the appropriate fix-ups).

		Linus

From: Linus Torvalds <torva...@cs.helsinki.fi>
Subject: Re: Some matters on the new verify_area
Date: 1996/10/12
Message-ID: <Pine.LNX.3.91.961012093405.21523A-100000@linux.cs.Helsinki.FI>#1/1
X-Deja-AN: 188896135
sender: owner-linux-ker...@vger.rutgers.edu
references: <m0vBqxm-0005FeC@lightning.swansea.linux.org.uk>
content-type: TEXT/PLAIN; charset=US-ASCII
x-hdr-sender: torva...@cs.helsinki.fi
mime-version: 1.0
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel




On Sat, 12 Oct 1996, Alan Cox wrote:
> 
> This is still not right. A Pipe is guaranteed to atomically write _ALL_ the
> data or none, not half. A short write isnt the correct or POSIX legal response.

Alan, don't be silly.

Guys, you're being stupid on purpose here, or something. Can't you understand
that THERE IS NO PROBLEM! 

You can make any damn semantics you like with the exception model, you just
have to check the return value of "copy_from_user()" or "copy_to_user()". 
Depending on the return value you can choose to ignore the partial data
you've written or not. For pipes, for example, we already lock the pipe data
over any user-level transaction, so if a pipe write notices "oops, this copy
failed", it is _trivial_ to just undo the write. 

> Consider a UDP send, where we fault half way through sending. At that point
> we have sent half of the IP datagram, but the latter half.

Bite me.

I don't _CARE_. The new way of handling things is a damn lot faster than the
old one, and it doesn't actually break any semantics at all, despite all your
whining. 

Your argument that it makes it easier to send partial IP packets is equally
bogus. We get a RETURN VALUE from the copy, for chirst sake. If you're so
scared of somebody doing something evil, you can zero the rest of the 
packet and send out incomplete copies. Total code needed:

	bytes_uncopied = copy_and_csum_from_user(buf, ubuf, len,
		&skb->csum, partial_csum);
	if (bytes_uncopied) {
		memset(buf+len-bytes_uncopied, 0, bytes_uncopied);
		skb->csum = csum_partial(buf, len, partial_csum);
		sk->error = EFAULT;	/* tell the user not to do that */
	}

Complex? Nope. Rocket science? Definitely not.

The above is assuming we want to care about the whole problem with only
partial IP packets sent in the first place. Quite frankly, it's not exactly
_our_ problem if there are sites out there that crash when they get too many
partial packets. Any random hacker can put a DOS machine on the net (or get
root access to a Linux machine) and send out partial fragments without any
help from the kernel at all.. You're arguing for "security by making it a bit
harder to do".. 

Also, Alan, you're are sadly mistaken if you think you can easily do memory
area lockings. It's a _lot_ more complex than you think, because it's not
enough to lock the actual virtual memory areas (the "struct vm_area_struct"),
you _also_ have to lock any inodes that are associated with a shared virtual
memory area. 

The problem is that UNIX semantics for shared file mappings are _complex_:
it's not enough that the virtual memory area is mapped, because any access
past the end of the file is _also_ a fault. We don't handle it correctly
right now, simply because it's so hard to handle. But with the new exception
handling we _can_ handle things correctly. 

If you wanted to do a locking approach, you'd not only have to lock any
vm_areas that are involved with a transfer from other threads, you'd _also_
have to lock any files that are shared-mapped if you want to get it right. 
The possibilities for problems are endless, and you also have some local
security issues due to it (essentially the same security issues that arise
from mandatory locking and NFSD). 

Trust me, you _cannot_ handle it even reasonably efficiently. You can do a
half-assed job with reasonable overhead (that's what "verify_area()" 
essentially has done before I re-wrote the whole thing), but you simply 
_cannot_ do a good job efficiently.

The new exception handling isn't going away. I get the overhead of doing a
verified copy from user space down to about _five_ machine instructions with
the new exception handling code, and quite frankly, no locking scheme can
even come _close_. Not even a half-assed one, and certainly not a scheme that
takes shared file faults into account. 



		Linus

From: l...@neteng.engr.sgi.com (Larry McVoy)
Subject: Re: Some matters on the new verify_area 
Date: 1996/10/12
Message-ID: <199610120741.AAA15159@neteng.engr.sgi.com>#1/1
X-Deja-AN: 188896394
sender: owner-linux-ker...@vger.rutgers.edu
x-hdr-sender: l...@neteng.engr.sgi.com 
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel


: > This is still not right. A Pipe is guaranteed to atomically write _ALL_ the
: > data or none, not half. A short write isnt the correct or POSIX legal response.
: 
: You can make any damn semantics you like with the exception model, you just
: have to check the return value of "copy_from_user()" or "copy_to_user()". 

If I understand this correctly (maybe I don't) then it implies that all
I/O gizmos are now of the form

	bytes_uncopied = copy_from_user(args...);
	if (bytes_uncopied && i_care_that_it_didnt_do_it_all) {
		do something to undo it
	}

Which in turn implies that all callers of copy_from_user() are copying into
a buffer that can be undone.  This would seem to be hard, but maybe it isn't.
Certainly in file systems you will have the object (file/pipe/whatever)
locked such that only you can muck with it, so conceivably you can unmuck
it.  

Can anyone think of a case that can't be handled?  Rephrased: Linus thinks 
that all Unix semantics can be handled through his new interface.  As long 
as you can tell how much has been moved, and you can either undo or return
the bytes moved, then I think that we are OK, are we not?

And I assume Linus is going to show us some studly i/o rates (I take that 
the new thing is much faster - it seems like it is aimed at common usage, 
i.e., reducing copyin/copyout latency?).

--lm

From: Linus Torvalds <torva...@cs.helsinki.fi>
Subject: Re: Some matters on the new verify_area 
Date: 1996/10/12
Message-ID: <Pine.LNX.3.91.961012121223.23937A-100000@linux.cs.Helsinki.FI>
X-Deja-AN: 188913763
sender: owner-linux-ker...@vger.rutgers.edu
references: <199610120741.AAA15159@neteng.engr.sgi.com>
content-type: TEXT/PLAIN; charset=US-ASCII
x-hdr-sender: torva...@cs.helsinki.fi
mime-version: 1.0
x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu
newsgroups: linux.dev.kernel




On Sat, 12 Oct 1996, Larry McVoy wrote:
> 
> If I understand this correctly (maybe I don't) then it implies that all
> I/O gizmos are now of the form
> 
> 	bytes_uncopied = copy_from_user(args...);
> 	if (bytes_uncopied && i_care_that_it_didnt_do_it_all) {
> 		do something to undo it
> 	}

Some are. The hairy ones. But actually I'd expect only very few copies to be
of that type, and most being of the type where you can just do a partial read
or write. There are actually _very_ few places that _require_ an "atomic"
operation. 

For example, the file read code in mm/filemap.c just does

	nr -= copy_to_user(buf, page_cache, nr);
	error = -EFAULT;
	if (!nr)
		break;
	buf += nr;
	pos += nr;
	read += nr;
	...


(and then the return condition from the actual system call is

	if (!read)
		read = error;
	return read;

but that's actually not something new: that has been there before to handle
various other errors, so EFAULT is not a special case at all in this case). 

Note that the UDP datagram example that Alan was worried about was a totally
different matter: there the worry wasn't so much the return value of the
function itself, but Alan worried that we'd send out the first few fragments
of a larger IP packet, and then not send out the rest at all. So he
essentially worried about what showed up on the wire, because some BSD stacks
don't handle partial fragments very well (out-of-memory errors because they
don't time out the fragments?)

> Which in turn implies that all callers of copy_from_user() are copying into
> a buffer that can be undone.  This would seem to be hard, but maybe it isn't.
> Certainly in file systems you will have the object (file/pipe/whatever)
> locked such that only you can muck with it, so conceivably you can unmuck
> it.  

Pipes are really very special cases because they require that a write be
atomic if it is smaller or equal to PIPE_BUF (or whatever the size was
called). In short, POSIX pipes are actually very "non-unixy" (and I can't say
I like the behaviour, but hey, it's not that hard to do). 

For just about anything else (including pipes with larger buffer sizes), it's
completely acceptable to just do a partial read or write. If I remember
correctly, POSIX says that a partial read or write from a file indicates
either EOF or an IO error, and EFAULT is certainly an IO error as far as the
the reader/writer is concerned ;)

Again, I'd like to point out that EFAULT is a _programmer_ error, and that we
don't really have to worry about any standards-conforming programs at all. In
some sense any program that _ever_ results in EFAULT or a partial read/write
due to that fault is never a POSIX-conforming program, and as such we could
even just say "to hell with it, let's kill the program outright".  Quite
frankly, I don't think we'd need to undo anything at all for the pipe case
(just do a partial IO operation), and we'd still be "POSIX-conforming". 

EFAULT really _is_ just a "segmentation fault" in a system call. Using
exceptions internally is only the natural way to handle it. 

For example, at least my copy of some old XPG3 thing doesn't even _mention_
EFAULT in the error cases. Likewise, the "POSIX Programmer's Manual" does
actually mention EFAULT and expands its meaning, but it also states that "no
function is actually required to check against this" or something like that
(and it has an empty list of calls that can return it). 

I don't actually have the official POSIX standards, but I suspect the
situation is similar (ie EFAULT is not really considered a "real" error that
can occur in a POSIX-conforming program). Can somebody check?

> And I assume Linus is going to show us some studly i/o rates (I take that 
> the new thing is much faster - it seems like it is aimed at common usage, 
> i.e., reducing copyin/copyout latency?).

Actually, the studly thing isn't the IO bandwidth, because that is pretty
much always limited by memory copy speeds and/or device limitations. The
_studly_ thing is the latency, because for latency numbers the actual copy
doesn't dwarf the time it takes to check. 

And as you may remember, I actually think latency is at _least_ as important
as throughput. 

		Linus

From: l...@neteng.engr.sgi.com (Larry McVoy)
Subject: Re: Some matters on the new verify_area
Date: 1996/10/12
Message-ID: <53oqt2$l44@fido.asd.sgi.com>#1/1
X-Deja-AN: 189008214
references: <Pine.LNX.3.91.961011195618.5160A-100000@linux.cs.Helsinki.FI> 
<m0vBqxm-0005FeC@lightning.swansea.linux.org.uk>
x-submitted-via: n...@ratatosk.yggdrasil.com (linux.* gateway)
x-hdr-sender: l...@neteng.engr.sgi.com 
organization: Silicon Graphics Inc., Mountain View, CA
x-env-sender: n...@fido.asd.sgi.com
newsgroups: linux.dev.kernel


: > Anyway, it's not hard to do things like that. In the future, if you give 
: > a buffer that is partially valid, Linux will use up as much as possible 
: > of the buffer, and return the "used up" portion. Admittedly 2.1.3 doesn't 
: > do that, but it's not actually very hard to fix (my personal kernel 
: > already does the appropriate fix-ups).

: This is still not right. A Pipe is guaranteed to atomically write _ALL_ the
: data or none, not half. A short write isnt the correct or POSIX legal response.

I used to think this too (and I implemented all of posix.1 in SunOS, shame
on me :-) but it isn't so.  If you read the spec carefully, what it is 
really saying is that writes of up to PIPEBUF bytes must be atomic.  Anything
after that is up for grabs.  The intent of the spec is to allow multiple
writes to a pipe to get their data in there without it being garbled with
the other guys.  So as long as everyone is <= PIPEBUF, then the order is
undefined but the results are defined: each guy gets their stuff in there,
all of it, or doesn't.  But there is never a half in half out case.

This implies that you can not context switch or preempt whilest in the 
middle of a copyin/copyout to/from a pipe (actually I don't remember if
the read side has the same semantics but it would be pretty silly if
it didn't - yeah, I can get it in unscrambled but it comes out scrambled.
I don't _think_ so).

As to what happens for sizes > PIPEBUF?  Posix is ecplicit about not 
defining this.  My personal preference is that the kernel treat each
PIPEBUF chunk individually, resetting its thinking on each boundry.
That would lend reasonable results to applications that want pipe
semantics for nbyte > PIPEBUF.  The rationale for not doing that was
that 4k was big enough, you can always do multiple writes.  Short sighted.

Getting back to the semantics of <= PIPEBUF, which is where all our issues
are:  POSIX didn't consider the EFAULT semantics that Linus has added so
we get no help there.  What they did consider is something quite similar,
signals.  They carefully spelled out what happens if you are doing a write
and you get interrupted in the middle.  I think we should follow those 
semantics.  In that case, POSIX states:  

	If a write() is interrupted by a signal before it writes any data,
	it shall return -1 with errno set to EINTR.

	If a write() is interrupted by a signal after it successfuly 
	writes some data, it shall return -1 with errno set to EINTR,
	or it shall return the number of bytes written.  A write() to 
	a pipe or a FIFO shall *never* return with errno set to EINTR 
	if it has transfered any data and nbyte is less than or equal
	to PIPE_BUF.

I think what they are saying is that you can be as sloppy as you like
elsewhere, but the results on pipes have to be exact.  In general,
Linus/Linux has always taken the high moral ground and applied the 
clean semantics to everything.  I don't see why this should be an
exception.

My personal suggestions:

	. for everything except pipes & sockets, return the number of
	  bytes written, no matter what that was.
	
	. for pipes, at least do the right thing for <= PIPEBUF.  If you
	  get a fault (or if for any other reason you transfer < nbyte),
	  undo the entire I/O rather than return < nbyte.  This is easy
	  for applications to handle and is in the spirit of supporting
	  atomic transfers for communication.
	
	. for sockets, treat them like pipes with a PIPE_BUF size == MTU.
	  POSIX doesn't cover sockets (maybe they will/do in a later spec)
	  but I'll bet you a plugged nickel they will have those semantics.

Finally, let's stick together on this one.  I know it is anal in the extreme
but this sort of stuff is the kind of stuff that differentiates a toy from
a real product.  There are billions of dollars of Unix systems sold every
year that depend on POSIX semantics - and spec the conformance as a purchasing
go/no go.
--
---
Larry McVoy     l...@sgi.com     http://reality.sgi.com/lm     (415) 933-1804