Comments to WinNT Mag

[OT] Comments to WinNT Mag !!
Linux Lists (lists@cyclades.com)
1999-04-30 19:47:03

Hello,

As I don't have the technical expertise to discuss with this gentleman
(although I think somebody _must_), I'm forwarding this URL to the list:

http://www.winntmag.com/Magazine/Article.cfm?ArticleID=5048

I hope you guys can address this article's issues properly. Please contact
the author !!

Regards,
Ivan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [OT] Comments to WinNT Mag !!
Ingo Molnar (mingo@chiara.csoma.elte.hu)
Fri, 30 Apr 1999 23:05:26 +0200 (CEST)

On Fri, 30 Apr 1999, Linux Lists wrote:

> As I don't have the technical expertise to discuss with this gentleman
> (although I think somebody _must_), I'm forwarding this URL to the list:
>
> http://www.winntmag.com/Magazine/Article.cfm?ArticleID=5048
>
> I hope you guys can address this article's issues properly. Please contact
> the author !!

While he (Mark Russinovich) does have valid points (no OS is perfect), he
is exagerating things alot at the expense of Linux, to make NT appear in a
better light. He also forgets to mention lots powerful mechanisms and
features present in the Linux kernel that give it an edge over NT, and he
forgets to mention shortcomings of NT in the same areas. But i think his
main mistake is that he is trying to find the NT API in Linux. No, Linux
is Linux. I'll try to address most of his technical points briefly:

'asynchron IO'
--------------

first he claims Linux has only select(), and then he continues to bash
select(). (without providing measurements or benchmark numbers) Then he
says that Linux _does_ have asynchron IO events implemented in 2.2 but
says that they have 'two major limitations'. Both 'limitations' he
mentions are in fact a pure implementation matter and not a mechanism or
API limitation. Mark also forgot to mention that Linux asynchron IO is
superior to NT because we do not have to poll the completion port for
events, we can have the IO event delivered _immediately_ to the target
thread (which is preempted by a signal if it's running). This gives more
flexibility of using asynchron events. (i have pointed out this difference
to him in private discussions, he left this argument unanswered)

'overscheduling'
----------------

here he again forgets to _prove_ that overscheduling happens in Linux.
Measurements have been done on big busy Linux webservers (much bigger than
the typical 'enterprise' category), and the runqueue lenghth (number of
threads competing for requests) was 3-4 typically. Enuff said ...

'kernel reentrancy'
-------------------

his example is a clear red herring. If any Linux application is
read()/write() intensive to the page cache, it should better use mmap(). I
can understand Mark did not mention mmap(), NT has a rather inferior
mmap() implementation. (eg. read()/write() and mmap()-ed modifications
done to the same file are not guaranteed to be data-coherent by NT ...)
His threading point is correct, there is still code left to be threaded
for SMP operation. Just as NT has one single big lock in it's networking
stack in NT4 SP4. (only SP5 has fixed this, which is not yet out of the
beta status.)

'sendfile'
----------

sendfile() is a new system call. The copying problem he noticed is true,
but it's a matter of the networking code, not some conceptual problem with
sendfile(). If the networking code does zero-copy then sendfile() will do
zero-copy as well. (without the user ever noticing) sendfile() will
certainly be further optimized in 2.3.

in private discussions with Mark i have pointed out most of these
counter-arguments, which he unfortunately failed to answer. He also didnt
answer my questions about NT's shortcomings in the above areas. (as
always, seemingly powerful concepts can often open up ugly ratholes)
Different OS, different approach. Let the numbers talk.

-- mingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [OT] Comments to WinNT Mag !!
Linux Lists (lists@cyclades.com)
Fri, 30 Apr 1999 15:58:20 -0700 (PDT)

On Fri, 30 Apr 1999, Ingo Molnar wrote:
>
> On Fri, 30 Apr 1999, Linux Lists wrote:
>
> > As I don't have the technical expertise to discuss with this gentleman
> > (although I think somebody _must_), I'm forwarding this URL to the list:
> >
> > http://www.winntmag.com/Magazine/Article.cfm?ArticleID=5048
> >
> > I hope you guys can address this article's issues properly. Please contact
> > the author !!
>
> While he (Mark Russinovich) does have valid points (no OS is perfect), he
> is exagerating things alot at the expense of Linux, to make NT appear in a
> better light. He also forgets to mention lots powerful mechanisms and
> features present in the Linux kernel that give it an edge over NT, and he
> forgets to mention shortcomings of NT in the same areas. But i think his
> main mistake is that he is trying to find the NT API in Linux. No, Linux
> is Linux. I'll try to address most of his technical points briefly:
>
> 'asynchron IO'
> --------------
>
> first he claims Linux has only select(), and then he continues to bash
> select(). (without providing measurements or benchmark numbers) Then he
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
That was the first thing I noticed in his article. He talks a lot about
benchmarks, about how Linux "isn't there -- yet" ... but he does _not_
provide the benchmark results to back him up. However, I didn't want to
reply to him myself because I wouldn't know how to discuss more specific
issues with him (as you, mingo, have done in this reply). Thanks for
clarifying all these issues to me (and to him too ;) !!

> in private discussions with Mark i have pointed out most of these
> counter-arguments, which he unfortunately failed to answer. He also didnt
> answer my questions about NT's shortcomings in the above areas. (as
> always, seemingly powerful concepts can often open up ugly ratholes)
> Different OS, different approach. Let the numbers talk.

Ok, if you (and others) have already contacted him and showed him these
fallacies, good. However, I think there should be a _public_ place where
this clarification is available. Otherwise, misinformed people who read
this article will have the wrong idea of Linux, and they won't have any
other place that tells them otherwise.

Anyhow, thanks for your helpful reply again !!

Regards,
Ivan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: BROWN Nick <Nick.BR...@coe.int>
Subject: RE: [OT] Comments to WinNT Mag !!
Date: 1999/05/01
Message-ID: <fa.go20gsv.1b4csqc@ifi.uio.no>#1/1
X-Deja-AN: 472944913
Original-Date: Sat, 1 May 1999 20:01:20 +0200 
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-Id: <435C366F075ED211B12200204840172D703E46@PETITSUIX>
To: "'Linux Lists'" <li...@cyclades.com>, Ingo Molnar <mi...@chiara.csoma.elte.hu>
Content-Type: text/plain
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
Mime-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

	>However, I think there should be a _public_ place where this
clarification is available. 

I have corresponded briefly with Mark Russinovich on NT-related issues in
the past, and he has always seemed a very fair-minded individual.  He
certainly isn't in Microsoft's pocket, for example.  I would hope that if
asked, he would contribute some follow-up to this forum.  Of course, he
makes his living from NT - both its strengths and its shortcomings.

Nick Brown, Strasbourg, France (Nick(dot)Brown(at)coe(dot)int)

email address updates : @coe.int replaces  @coe.fr
for more information, http://dct.coe.int/info/emfci001.htm

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: "Shane R. Stixrud" <sh...@souls.net>
Subject: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd)
Date: 1999/05/02
Message-ID: <fa.j7hkodv.1c0gabc@ifi.uio.no>
X-Deja-AN: 473138027
Original-Date: Sun, 2 May 1999 03:52:39 -0700 (PDT)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.3.96.990502034629.15430A-100000@souls.net>
To: linux-ker...@vger.rutgers.edu
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

I E-mailed Mark Russinovich and copied him on the "[OT] Comments to WinNT
Mag !!" thread and suggested he respond.  He sent me his response and
requested that I forward it onto the list.  Response below.  

---------- Forwarded message ----------
Date: Sun, 02 May 1999 06:13:00 -0400
From: Mark Russinovich <m...@sysinternals.com>
To: "Shane R. Stixrud" <sh...@souls.net>
Subject: Re: [OT] Comments to WinNT Mag !! (fwd)

Hi Shane,

Please post my response to the list.

At 01:24 PM 5/1/99 , you wrote:
>
>'asynchron IO'
>--------------
>
>first he claims Linux has only select(), and then he continues to bash
>select(). (without providing measurements or benchmark numbers) Then he
>says that Linux _does_ have asy	nchron IO events implemented in 2.2 but
>says that they have 'two major limitations'. Both 'limitations' he
>mentions are in fact a pure implementation matter and not a mechanism or
>API limitation. Mark also forgot to mention that Linux asynchron IO is
>superior to NT because we do not have to poll the completion port for
>events, we can have the IO event delivered _immediately_ to the target
>thread (which is preempted by a signal if it's running). This gives more
>flexibility of using asynchron events. (i have pointed out this difference
>to him in private discussions, he left this argument unanswered) 
>

Completion ports in NT require no polling and no linear searching - that,
and their integration with the scheduler, is their entire reason for
existence. Also, Linux's implementation of asynchronous I/O only applies to
tty devices and to *new connections* on sockets - nothing else. Sure
asynchronous I/O can be added to the rest of the I/O architecture (all of
the deficiencies I bring up can, and I'm sure will, be addressed). My point
is that it is currently  very limited.

>'overscheduling'
>----------------
>
>here he again forgets to _prove_ that overscheduling happens in Linux. 
>Measurements have been done on big busy Linux webservers (much bigger than
>the typical 'enterprise' category), and the runqueue lenghth (number of
>threads competing for requests) was 3-4 typically. Enuff said ...
>

Under high load environments even the short run-queue lengths you refer to
are enough to degrade performance. And in the environments I'm talking
about where there are several hundred requests being served concurrently,
the run queue lengths for Linux are significantly higher with the
implementation of a one-thread-to-one-client server model.

>'kernel reentrancy'
>-------------------
>
>his example is a clear red herring. If any Linux application is
>read()/write() intensive to the page cache, it should better use mmap(). I
>can understand Mark did not mention mmap(), NT has a rather inferior
>mmap() implementation. (eg. read()/write() and mmap()-ed modifications
>done to the same file are not guaranteed to be data-coherent by NT ...)  
>His threading point is correct, there is still code left to be threaded
>for SMP operation. Just as NT has one single big lock in it's networking
>stack in NT4 SP4. (only SP5 has fixed this, which is not yet out of the
>beta status.)
>

First, serialization of long paths through the kernel degrade
multiprocessor scalability - this is multiprocessing 101. 

You mention mmap, and I'm assuming you do so as an alternative to sendfile.
Using mmap to serve files, the following is required:

	- the file is mapped with a call to mmap(). The kernel must manipulate the
page tables of the process performing the map.
	- the process calls writev() to send an HTTP header in one buffer and file
data from the mapped memory. This is another system call and two copies.

There are 1-3 system calls (depending on whether the requested file has
already been mapped, or another file must be unmapped to make room for the
new mapping via mmap) , 2 buffer copies, and manipulation of the process
page tables. The process must also manage its own file cache, unmapping and
mapping files as needed. The file system is also performing the same
management of the file system cache. 

BTW This isn't related to read-only file serving, but Linus admits that
mmap in 2.2 has a flaw where write-backs to a modified file result in two
copies instead of 1. He says that this will probably be fixed in 2.3.x.

On the other hand, Sendfile on NT, Solaris, HP/UX and AIX are used as follows:

	- one call to sendfile() is made, and the call specifies buffers that
serve as a prologue (e.g HTTP header) and epilogue to the file data, in
addition to a file handle. The TCP stack sends the file data directly from
the file system cache as a 0-copy send. The user buffers are also sent with
the file data, and are not copied from user space, but locked into physical
memory for the duration of the send.

This implementation has 0 buffer copies and requires 1 system call to send
an entire HTTP response. There is no manipulation of process address space,
and the server need not manage its own file cache. In addition, the call
can be made asynchronously, where waiting is done on a completion port that
is waiting on new connections and more requests on existing connections.
The asynchronous I/O model in NT extends to all I/O. NT (and Solaris,
HP/UX, AIX) also have another API that Linux doesn't have yet: acceptex
(the name of the NT version). This API is used to simultaneously perform an
accept, the first read(), and geetpeer() in one system call. The advantages
should be obvious.

As for the Linux implementation of sendfile(), it does not support adding a
header and the Linux TCP stack does not support 0-copy sends. Thus, there
is an extra system call and buffer copy for a write() to send the header,
and an extra buffer copy for sending the file.

>'sendfile'
>----------
>
>sendfile() is a new system call. The copying problem he noticed is true,
>but it's a matter of the networking code, not some conceptual problem with
>sendfile(). If the networking code does zero-copy then sendfile() will do
>zero-copy as well. (without the user ever noticing) sendfile() will
>certainly be further optimized in 2.3.
>

Just to clarify, the Linux TCP/IP stack does not support 0-copy sending.
See tcp_do_sendmsg() in net/ipv4/tcp.c. Note the calls to
xx_copy_from_user() (the copy functions are macros defined in the
architecture-specific include file uaccess.h). 

Like I said, I'm sure that over time the Linux problems will be fixed, but
my article was about the state of Linux *today*, not next year or the year
after.

>in private discussions with Mark i have pointed out most of these
>counter-arguments, which he unfortunately failed to answer. He also didnt
>answer my questions about NT's shortcomings in the above areas. (as
>always, seemingly powerful concepts can often open up ugly ratholes)
>Different OS, different approach. Let the numbers talk.
>

I try to answer all e-mail that raise technical issues. If I failed to
answer yours, Ingo, then it was simply because I was too busy.

-Mark

Mark Russinovich, Ph.D.
NT Internals Columnist, Windows NT Magazine
http://www.winntmag.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd)
Ingo Molnar (mingo@chiara.csoma.elte.hu)
Sun, 2 May 1999 14:12:39 +0200 (CEST)

On Sun, 2 May 1999, Mark Russinovich wrote:

> >first he claims Linux has only select(), and then he continues to bash
> >select(). (without providing measurements or benchmark numbers) Then he
> >says that Linux _does_ have asy nchron IO events implemented in 2.2 but
> >says that they have 'two major limitations'. Both 'limitations' he
> >mentions are in fact a pure implementation matter and not a mechanism or
> >API limitation. Mark also forgot to mention that Linux asynchron IO is
> >superior to NT because we do not have to poll the completion port for
> >events, we can have the IO event delivered _immediately_ to the target
> >thread (which is preempted by a signal if it's running). This gives more
> >flexibility of using asynchron events. (i have pointed out this difference
> >to him in private discussions, he left this argument unanswered)
> >
>
> Completion ports in NT require no polling and no linear searching - that,
> and their integration with the scheduler, is their entire reason for
> existence. [...]

they require a thread to block on completion ports, or to poll the status
of the completion port. NT gives no way to asynchronously send completion
events to a _running_ thread.

> [...] Also, Linux's implementation of asynchronous I/O only applies to
> tty devices and to *new connections* on sockets - nothing else. [...]

yes, networking is the main user of asynchronous events. Given that
asynchronous IO is rather new under Linux, it was a natural choice.

> Sure asynchronous I/O can be added to the rest of the I/O architecture

no. I personally think that networking is about the only place where this
technique has a long term future ... do you suggest that any 'enterprise
server' is IO-bound on block devices? But yes, it can be added. (squid for
one could benefit from it, but even squid is typically memory or disk seek
time limited)

> >here he again forgets to _prove_ that overscheduling happens in Linux.
> >Measurements have been done on big busy Linux webservers (much bigger than
> >the typical 'enterprise' category), and the runqueue lenghth (number of
> >threads competing for requests) was 3-4 typically. Enuff said ...
> >
>
> Under high load environments even the short run-queue lengths you refer to
> are enough to degrade performance. And in the environments I'm talking
> about where there are several hundred requests being served concurrently,
> the run queue lengths for Linux are significantly higher with the
> implementation of a one-thread-to-one-client server model.

do you suggest Dejanews does not work? You are often taking architectural
examples from the NT side, without measuring the Linux side. I actually
have a test-setup here that does 2000 new Apache connections a second
(over a real network), and no, we do not 'overschedule'.

It's often apples to oranges, and i'd really suggest you that before you
bash any architectural solution (in _any_ OS) as a 'severe limitation' you
better be damn sure right, or wear asbestos. I hope i'm not sounding
arrogant, _if_ we get into an overscheduling situation on the networking
side we already have plans to address it (with a few lines of change), but
currently it's not necessery. I have seen no request for discussion from
you on linux-kernel about overscheduling.

> >'kernel reentrancy'
> >-------------------
> >
> >his example is a clear red herring. If any Linux application is
> >read()/write() intensive to the page cache, it should better use mmap(). I
> >can understand Mark did not mention mmap(), NT has a rather inferior
> >mmap() implementation. (eg. read()/write() and mmap()-ed modifications
> >done to the same file are not guaranteed to be data-coherent by NT ...)
> >His threading point is correct, there is still code left to be threaded
> >for SMP operation. Just as NT has one single big lock in it's networking
> >stack in NT4 SP4. (only SP5 has fixed this, which is not yet out of the
> >beta status.)
>
> First, serialization of long paths through the kernel degrade
> multiprocessor scalability - this is multiprocessing 101.

yes, sure. Do they make an OS 'unable to handle the enterprise category',
nope. Just like NT's deficiencies do not necesserily make it incapable. As
i've explained to you, much of the Linux IO path (the interrupt part) goes
under a different lock.

> You mention mmap, and I'm assuming you do so as an alternative to sendfile.

not at all. You mentioned cached read()/write(), and i just pointed out
that if you do heavy cached read()s and write()s then you do the wrong
thing. I've attached a patch from David S. Miller that deserializes much
of the 'heavy' parts of reads and writes in the ext2, pipe, TCP and
AF_UNIX path. The patch adds 50 new lines. (just that people get a picture
about the magnitude of these 'severe limitations') But yes, Linux still
has a way to go.

> BTW This isn't related to read-only file serving, but Linus admits that
> mmap in 2.2 has a flaw where write-backs to a modified file result in two
> copies instead of 1. He says that this will probably be fixed in 2.3.x.

yes this is a known problem. (_this_ is what i consider to be one of the
top Linux problems, not the other ones you mention.)

> This implementation has 0 buffer copies and requires 1 system call to send
> an entire HTTP response. There is no manipulation of process address space,
> and the server need not manage its own file cache. In addition, the call
> can be made asynchronously, where waiting is done on a completion port that
> is waiting on new connections and more requests on existing connections.
> The asynchronous I/O model in NT extends to all I/O. NT (and Solaris,
> HP/UX, AIX) also have another API that Linux doesn't have yet: acceptex
> (the name of the NT version). This API is used to simultaneously perform an
> accept, the first read(), and geetpeer() in one system call. The advantages
> should be obvious.

_please_, could you time NT Solaris and HP/UX, how much they take for a
single sendfile() system call, and compare it to Linux null syscall
latencies? The Linux numbers are:

[mingo@moon l]$ ./lat_syscall null
Simple syscall: 0.8403 microseconds

one reason we made syscalls so lightweight is to avoid silly
'multi-purpose' conglomerate system calls like NT has. sendfile() has
mainly not been added to avoid system calls being done, but because it's
strong (and unique) conceptual foundations. Linux syscalls will be even
more lightweight in the future. (i have a prototype patch that makes them
cost 0.30 microseconds) Do you see the point, again an apples to oranges
problem.

> As for the Linux implementation of sendfile(), it does not support adding a
> header and the Linux TCP stack does not support 0-copy sends. Thus, there
> is an extra system call and buffer copy for a write() to send the header,
> and an extra buffer copy for sending the file.
[...]

> Just to clarify, the Linux TCP/IP stack does not support 0-copy sending.

zero-copy has backdraws too. (latency ones mainly) you seem to be very
much focused on bandwith, but thats not everything. Could you please
compare the latencies of the Linux and NT TCP stack? (i have) Or do you
believe that latency does not matter? But yes in certain circumstances we
want to have zero-copy. (sendfile is one such example)

> >in private discussions with Mark i have pointed out most of these
> >counter-arguments, which he unfortunately failed to answer. He also didnt
> >answer my questions about NT's shortcomings in the above areas. (as
> >always, seemingly powerful concepts can often open up ugly ratholes)
> >Different OS, different approach. Let the numbers talk.
>
> I try to answer all e-mail that raise technical issues. If I failed to
> answer yours, Ingo, then it was simply because I was too busy.

my major problem with your analysis is that in my opinion you paint a
one-sided picture, NT always on the 'winner' side, and Linux on the
'loser' side. Am i correct to understand that you consider Linux to be an
inferior design? I think there are two more technical issues you left
unanswered previously:

- CPU-specific optimizations. NT offers one single binary image for all
x86 CPU architectures. (barring the SMP/UP distinction) How do you explain
the speed penalty to your 'enterprise costumers'? The same holds for
CPU-specific assembly optimizations.

- NT's 'hidden locks'. Just as NT4 SP5 beta introduced 'deserialization'
silently into the networking code. (and certainly they claimed NT to be in
the 'enterprise category' years before) Are you 100% sure there are no
other NT subsystems left out 'accidentally' that make it incapable of
handling the load of 'enterprise class servers'. How can you be sure that
NT's TCP timers are scalable? You do not seem to _honor_ and balance the
fact that Linux has all it's source code out there, and thus yes all the
mistakes are visible. NT is basically a black box. You quote manuals from
NT instead of source code. Then you compare that to NT without doing
head-to-head measurements.

-- mingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd)
Date: 1999/05/04
Message-ID: <fa.hm3a16v.3l8u1h@ifi.uio.no>
X-Deja-AN: 473758917
Original-Date: Tue, 4 May 1999 02:04:08 +0200 (CEST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.05.9905040015100.615-100000@laser.random>
References: <fa.j7hkodv.1c0gabc@ifi.uio.no>
To: Mark Russinovich <m...@sysinternals.com>
X-Sender: and...@laser.random
X-Authentication-Warning: laser.random: andrea owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
X-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Sun, 2 May 1999, Shane R. Stixrud wrote:

>Under high load environments even the short run-queue lengths you refer to
>are enough to degrade performance. And in the environments I'm talking

If I understand well here you are complaining about the fact that Linux
take a list of the running process and scan it at every schedule() to
choose the next process to run.

What I expect from changing the linked-list to a heap for example, would
be to have a slowdown in the normal case of most of linux usage, to gain a
1/2% in the unlikely case of a machine that have more than some hundred of
task running all the time. Maybe someday I'll try it but it's really a not
obvious improvement to my eyes.

Remeber also that the schedule() frequency will not increase with the
increse of the number of running task in the system.

>mapping files as needed. The file system is also performing the same
>management of the file system cache. 

The sentence above make no sense to me (maybe due my bad English ;). It's
the filesystem that manage directly the buffer cache while playing with
metadata or for writes.

>BTW This isn't related to read-only file serving, but Linus admits that
>mmap in 2.2 has a flaw where write-backs to a modified file result in two
>copies instead of 1. He says that this will probably be fixed in 2.3.x.

Not true. Only if the area was just in the page cache you are going to do
two copies. If you write to a file directly without read it you'll copy
the file _only_ to the buffer cache. And don't tell me that you would like
raw-IO (you don't want to copy to the buffer cache). Caching dirty buffers
is strictly strictly needed for any kind of usage in order to have decent 
performances on _all_ hardware (... excluding ram-disks ;).

But you missed a bit of you favour (it seems to me you don't know the
details of the linux VM so well, otherwise you would have just mentioned
this): the double copy applies also to normal writes not only to writes to
mmapped regions. But again, such double copy may save us time in other
places of the kernel and while it's true that the write(2) may (as just
said it will be slower _only_ if there is a page cache for the written
data) be slower, then the next read will be far faster than having to read
directly from the buffer cache. So if you write one time and you'll read
two times I think you would have just payed the cost of the double write.

Theorically we could drop the page cache instead of doing the copy, but
since the userspace data was just in the L1/L2 cache for sure it's better
to do the copy at the write(2) time, then at after some time according to
me.

I don't know exactly the plans to replace the copy to the page cache but I
am not too much worried about that. Just look at this:

andrea@laser:~$ procinfo    
Linux 2.2.7 (root@laser) (gcc egcs-2.91.60) #60 [laser.(none)]
[WARNING: can't print all IRQs, please recompile me]
Memory:      Total        Used        Free      Shared     Buffers      Cached
Mem:        127768      121276        6492       25524       19744       80304
Swap:        72256           0       72256

Bootup: Mon May  3 23:49:12 1999    Load average: 0.00 0.00 0.05 1/38 2697

user  :       0:05:21.71   4.9%  page in :   104101  disk 1:    19654r    5844w
					     ^^^^^^
nice  :       0:00:00.01   0.0%  page out:    18738
					      ^^^^^
system:       0:05:00.32   4.6%  swap in :        1
[..]

See the two underlined lines. The one above is the procinfo for the
machine I am running now but it's not a server, now I'll look how does it
looks e-mind.com:

andrea@penguin:~$ procinfo | grep page
user  :   1d  8:05:53.53   4.0%  page in : 72314934  disk 1: 11007497r 6712837w
					   ^^^^^^^^
nice  :       0:44:20.53   0.1%  page out: 36293162
					   ^^^^^^^^

Consider that e-mind.com does backup daily to a local HD and consider that
while doing the backup it never read the backup-file, so the write will be
only _to_ the buffer cache and we won't do the second copy to the page
cache because it was not present in first place.

>This implementation has 0 buffer copies and requires 1 system call to send
>an entire HTTP response. There is no manipulation of process address space,

Don't think that doing zero copy means an improvement. If you do a zero
copy of 1gigabyte of data ok, but if you do a zero copy of a 1kbyte of
data the issue is quite _much_ different.

I did now a `netcat localhost chargen >/dev/null' just to see how much a
zero copy would help a huge transfer of data. I left it running for some
time and here it is the numbers:

root@laser:/home/andrea# readprofile -m /System.old | sort -nr | head -20
  4808 total                                      0.0084
   565 csum_partial                               3.6218
   402 tcp_do_sendmsg                             0.2332
   364 sys_write                                  1.2466
   272 system_call                                4.2500
   234 __generic_copy_from_user                   3.6562
   144 sock_sendmsg                               0.8571
   141 tcp_recvmsg                                0.1034
   106 __release_sock                             0.8548
   104 __strncpy_from_user                        2.8889
    97 csum_partial_copy_generic                  0.4409
    88 do_readv_writev                            0.1897
    84 schedule                                   0.0905
    72 do_bottom_half                             0.4500
    71 ip_queue_xmit                              0.0740
    70 add_timer                                  0.1768
    68 __wake_up                                  0.8500
    66 tcp_rcv_established                        0.0437
    66 synchronize_irq                            3.3000
    66 kmem_cache_alloc                           0.1897

While it's true that one of most frequented code is been copy_from_user,
it's also true that the most of the time is been spent in csum_partial and
in other mixed overhead where the small and fast copy_from_user got hided.

Consider that I run the test in the worst case for a copy_from_user point
of view. No network overhead (loopback) so no irq flood, no small writes.

Just look at what happens running a different network load, again on the
loopback device. I'll use the lat_tcp (tcp benchmark you can found in
lmbench, if I understood well it does only a ping pong of packets, to
measure the latency of the TCP stack, so no contiguous stream of data).
Again no real network load, no disk access still many many conditions
_against_ a copy_from_user approch.

  4074 total                                      0.0071
   325 tcp_recvmsg                                0.2383
   158 add_timer                                  0.3990
   153 tcp_do_sendmsg                             0.0887
   144 schedule                                   0.1552
   144 ip_fw_check                                0.1390
   143 tcp_rcv_established                        0.0946
   110 kmalloc                                    0.2723
   105 kfree                                      0.2283
   104 sys_write                                  0.3562
    97 sys_read                                   0.4181
    92 tcp_transmit_skb                           0.0935
    90 __kfree_skb                                0.5357
    86 ip_queue_xmit                              0.0896
    83 tcp_clean_rtx_queue                        0.2767
    83 do_bottom_half                             0.5188
    78 kmem_cache_alloc                           0.2241
    74 __wake_up                                  0.9250
    71 skb_clone                                  0.4671
    70 kmem_cache_free                            0.1733
    68 schedule_timeout                           0.4857
    68 kfree_skbmem                               1.0625
    66 ip_local_deliver                           0.1130
[..] after a lot
    26 synchronize_bh                             0.3250
    26 loopback_xmit                              0.1327
    25 __tcp_select_window                        0.1420
    24 __generic_copy_from_user                   0.3750
	 ^^^^^^^^^^^^^^^^^^^^^^
    23 tcp_v4_send_check                          0.2300
    22 ipfw_output_check                          0.1719
    18 tcp_ack_saw_tstamp                         0.0643

so definitely the zero copy in the TCP stack looks like not an issue to
me. At least with a web load. A ftp load may be a bit different but I
still think it would be not an issue considering that you won't run on a
loopback but on a busy network card and you'll do "also" some access to
disk etc..etc...

Everything without considering the overhead of the code to do zero copy
and the fact that such code will increase a lot the complexity of the
TCP/stack. So without considering that we could have bugs and go slower in
the iteractive case (like the ping-pong of lmbench) because of the
overhead.

>Like I said, I'm sure that over time the Linux problems will be fixed, but
>my article was about the state of Linux *today*, not next year or the year
>after.

I instead think that the 0-copy in the TCP stack is a red-herring too. And
even if it would be an obvious improvement it's not obvious that we want
to increase the complexity of the code to achieve the thing.

Comments?

Andrea Arcangeli

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd)
Alan Cox (alan@lxorguk.ukuu.org.uk)
Tue, 4 May 1999 12:04:39 +0100 (BST)

> existence. Also, Linux's implementation of asynchronous I/O only applies to
> tty devices and to *new connections* on sockets - nothing else. Sure

Wrong

Why do people even bother playing along with him 8)

> addition to a file handle. The TCP stack sends the file data directly from
> the file system cache as a 0-copy send. The user buffers are also sent with
> the file data, and are not copied from user space, but locked into physical
> memory for the duration of the send.

BTW I hope Solaris didnt do that, there is a classic sendfile machine
destroying attack when you use a lot of slow connections to jam a machine
up with locked down pages. Fun for all the family

Similarly most of his other arguments are based on highly theoretical views
of computing. One thing writing a real OS instead of writing about it teaches
people is that 99% of OS theory is complete and utter crud.

Zero copy is a good example. For many things zero copy actually reduces
performance, especially on SMP machines, due to the amount of memory handling
work on the page locking.

That is why many OS's only do sendfile() based zero copy.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli <and...@e-mind.com>
Subject: Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd)
Date: 1999/05/04
Message-ID: <fa.j5qmg0v.17lsnba@ifi.uio.no>#1/1
X-Deja-AN: 473927649
Original-Date: Tue, 4 May 1999 15:08:01 +0200 (CEST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.05.9905041507250.3158-100000@laser.random>
References: <fa.jnhrduv.15g831v@ifi.uio.no>
To: Alan Cox <a...@lxorguk.ukuu.org.uk>
X-Sender: and...@laser.random
X-Authentication-Warning: laser.random: andrea owned process doing -bs
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
X-Public-Key-URL: http://e-mind.com/~andrea/aa.asc
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Tue, 4 May 1999, Alan Cox wrote:

>Similarly most of his other arguments are based on highly theoretical views
>of computing. One thing writing a real OS instead of writing about it teaches 
>people is that 99% of OS theory is complete and utter crud.

Agreed!! ;))

Andrea Arcangeli

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: [OT] Comments to WinNT Mag !!
Alan Cox (alan@lxorguk.ukuu.org.uk)
Tue, 4 May 1999 17:25:17 +0100 (BST)

> the past, and he has always seemed a very fair-minded individual. He
> certainly isn't in Microsoft's pocket, for example. I would hope that if
> asked, he would contribute some follow-up to this forum. Of course, he
> makes his living from NT - both its strengths and its shortcomings.

I've had several discussions with him. He avoids questions about points that
don't fit his pet theory of OS design, he criticises anything that doesn't
follow his theory and often doesnt seem to understand it - eg he didnt
understand the kernel lock in 2.0.x SMP.

Im not sure he's pro NT as pro his pet theory and NT happens to match.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: "Mark H. Wood" <mw...@IUPUI.Edu>
Subject: Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd)
Date: 1999/05/04
Message-ID: <fa.jdosuhv.o0ad87@ifi.uio.no>#1/1
X-Deja-AN: 474056607
Original-Date: Tue, 4 May 1999 12:23:50 -0500 (EST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <Pine.LNX.4.05.9905041213300.16180-100000@mhw.ULib.IUPUI.Edu>
References: <fa.ni9v06v.1jk0iot@ifi.uio.no>
To: unlisted-recipients:; (no To-header on input)
X-Sender: mw...@mhw.ULib.IUPUI.Edu
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

On Sun, 2 May 1999, Ingo Molnar wrote:
> On Sun, 2 May 1999, Mark Russinovich wrote:
> > Completion ports in NT require no polling and no linear searching - that,
> > and their integration with the scheduler, is their entire reason for
> > existence. [...]
> 
> they require a thread to block on completion ports, or to poll the status
> of the completion port. NT gives no way to asynchronously send completion
> events to a _running_ thread.

Ugh.  I liked the VMS model here.  When you queue an I/O request, one of
the things you can attach to it is the address of a procedure. When the
request completes, the kernel creates a temporary thread to execute the
I/O rundown code, and part of that rundown is to call the procedure you
supplied.  Your procedure would typically move something from a wait queue
to a work queue, or flip a bit in a bitmask, or link a buffer onto the
free chain, or whatever it takes to indicate that your regular thread(s)
should do whatever you want done when the I/O has completed.  When you
return, the rundown thread tidies up and destroys itself.  (Of course, if
you never return, or you try to do huge amounts of processing in your
rundown procedure, your program won't work very well.  Don't do that.
Keep it short and simple.)

-- 
Mark H. Wood, Lead System Programmer   mw...@IUPUI.Edu
Specializing in unusual perspectives for more than twenty years.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: kuz...@ms2.inr.ac.ru
Subject: Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd)
Date: 1999/05/04
Message-ID: <fa.dhdscev.vmg23g@ifi.uio.no>#1/1
X-Deja-AN: 474076740
Original-Date: Tue, 4 May 1999 20:37:25 +0400 (MSK DST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-Id: <199905041637.UAA14121@ms2.inr.ac.ru>
References: <fa.jnhrduv.15g831v@ifi.uio.no>
To: a...@lxorguk.UKuu.ORG.UK (Alan Cox)
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Hello!

For God's sake could someone explain me, what is the difference
between our sendfile() and plain write() from mmap()ed region?

The only difference, which I see now is that sendfile() ALLOWS
to make zero-copy NOT WORSE than usual write(), right?

Resuming, sendfile() without zero-copy is pure cheating,
if we added it to API it means that we plan to implement zero copy
one day. 8)

BTW is it really true, that NT transmitfile() does zero copy?
I strongly suspect, it does not.

Alexey

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd)
Alan Cox (alan@lxorguk.ukuu.org.uk)
Tue, 4 May 1999 19:01:21 +0100 (BST)

> For God's sake could someone explain me, what is the difference
> between our sendfile() and plain write() from mmap()ed region?

You have to take the overhead of mapping the entire file and of TLB
shootdowns while setting up the VM with mmap but not with sendfile().

> Resuming, sendfile() without zero-copy is pure cheating,
> if we added it to API it means that we plan to implement zero copy
> one day. 8)

Yep

> BTW is it really true, that NT transmitfile() does zero copy?
> I strongly suspect, it does not.

NT5 beta claims to

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: David Miller <da...@twiddle.net>
Subject: Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd)
Date: 1999/05/05
Message-ID: <fa.f38ll5v.1d4k83t@ifi.uio.no>#1/1
X-Deja-AN: 474169674
Original-Date: Tue, 4 May 1999 15:45:40 -0700
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-Id: <199905042245.PAA19632@piglet.twiddle.net>
References: <fa.jqk7i6v.12m479v@ifi.uio.no>
To: a...@lxorguk.ukuu.org.uk
Original-References: <m10ejVS-0007...@the-village.bc.nu>
X-Authentication-Warning: piglet.twiddle.net: davem set sender to da...@piglet.twiddle.net using -f
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
Reply-To: da...@redhat.com
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

   From: a...@lxorguk.ukuu.org.uk (Alan Cox)
   Date: 	Tue, 4 May 1999 19:01:21 +0100 (BST)

   > BTW is it really true, that NT transmitfile() does zero copy?  I
   > strongly suspect, it does not.

   NT5 beta claims to

They can avoid the extraneous copy, but what they cannot do with most
PC networking cards is avoid touching the data since most cards do not
provide a hardware checksumming facility.

Most of this would suggest that their existing architecture passes
mbuf-chain-like buffers to the networking drivers in NT, or some other
kind of scatter-gather list like scheme.  This is the only way they
could do zero-copy without driver updates from all the networking card
vendors.

Later,
David S. Miller
da...@redhat.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: Richard Gooch <rgo...@atnf.csiro.au>
Subject: Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd)
Date: 1999/05/05
Message-ID: <fa.i1p5a4v.jlg7ik@ifi.uio.no>#1/1
X-Deja-AN: 474183857
Original-Date: Wed, 5 May 1999 08:37:33 +1000
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-Id: <199905042237.IAA08304@vindaloo.atnf.CSIRO.AU>
References: <fa.jdosuhv.o0ad87@ifi.uio.no>
To: "Mark H. Wood" <mw...@IUPUI.Edu>
Original-References: <Pine.LNX.3.96.990502130955.21826D-200...@chiara.csoma.elte.hu> 
<Pine.LNX.4.05.9905041213300.16180-100...@mhw.ULib.IUPUI.Edu>
Notfrom: spam...@atnf.csiro.au
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Mark H. Wood writes:
> On Sun, 2 May 1999, Ingo Molnar wrote:
> > On Sun, 2 May 1999, Mark Russinovich wrote:
> > > Completion ports in NT require no polling and no linear searching - that,
> > > and their integration with the scheduler, is their entire reason for
> > > existence. [...]
> > 
> > they require a thread to block on completion ports, or to poll the status
> > of the completion port. NT gives no way to asynchronously send completion
> > events to a _running_ thread.
>
> Ugh.  I liked the VMS model here.  When you queue an I/O request,
> one of the things you can attach to it is the address of a
> procedure. When the request completes, the kernel creates a
> temporary thread to execute the I/O rundown code, and part of that
> rundown is to call the procedure you supplied.  Your procedure would
> typically move something from a wait queue to a work queue, or flip
> a bit in a bitmask, or link a buffer onto the free chain, or
> whatever it takes to indicate that your regular thread(s) should do
> whatever you want done when the I/O has completed.  When you return,
> the rundown thread tidies up and destroys itself.  (Of course, if
> you never return, or you try to do huge amounts of processing in
> your rundown procedure, your program won't work very well.  Don't do
> that.  Keep it short and simple.)

What was the cost of creating the "temporary thread"? Anyway, we can
do much the same thing with signals, except we don't need to create a
temporary thread.

				Regards,

					Richard....

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

From: "Stephen C. Tweedie" <s...@redhat.com>
Subject: Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd)
Date: 1999/05/06
Message-ID: <fa.iphodiv.1322j1s@ifi.uio.no>#1/1
X-Deja-AN: 474905354
Original-Date: Thu, 6 May 1999 18:52:47 +0100 (BST)
Sender: owner-linux-ker...@vger.rutgers.edu
Original-Message-ID: <14129.55023.267354.584313@dukat.scot.redhat.com>
Content-Transfer-Encoding: 7bit
References: <fa.i1p5a4v.jlg7ik@ifi.uio.no>
To: Richard Gooch <rgo...@atnf.csiro.au>
Original-References: <Pine.LNX.3.96.990502130955.21826D-200...@chiara.csoma.elte.hu> 
<Pine.LNX.4.05.9905041213300.16180-100...@mhw.ULib.IUPUI.Edu> 
<199905042237.IAA08...@vindaloo.atnf.CSIRO.AU>
Content-Type: text/plain; charset=us-ascii
X-Orcpt: rfc822;linux-kernel-outgoing-dig
Organization: Internet mailing list
MIME-Version: 1.0
Newsgroups: fa.linux.kernel
X-Loop: majord...@vger.rutgers.edu

Hi,

On Wed, 5 May 1999 08:37:33 +1000, Richard Gooch <rgo...@atnf.csiro.au>
said:

>> Ugh.  I liked the VMS model here.  When you queue an I/O request,
>> one of the things you can attach to it is the address of a
>> procedure. When the request completes, the kernel creates a
>> temporary thread to execute the I/O rundown code, and part of that
>> rundown is to call the procedure you supplied.  

> What was the cost of creating the "temporary thread"? 

There isn't one: the AST is scheduled in the context of the calling
process/thread.  AST delivery is an integral part of the scheduler, much
like signal delivery is on Unix.

> Anyway, we can do much the same thing with signals, except we don't
> need to create a temporary thread.

Yes --- ASTs are very similar to posix.4 queued signals.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/