[OT] Comments to WinNT Mag !!
Linux Lists (lists@cyclades.com)
1999-04-30 19:47:03
Hello,
As I don't have the technical expertise to discuss with this gentleman
(although I think somebody _must_), I'm forwarding this URL to the list:
http://www.winntmag.com/Magazine/Article.cfm?ArticleID=5048
I hope you guys can address this article's issues properly. Please contact
the author !!
Regards,
Ivan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: [OT] Comments to WinNT Mag !!
Ingo Molnar (mingo@chiara.csoma.elte.hu)
Fri, 30 Apr 1999 23:05:26 +0200 (CEST)
> As I don't have the technical expertise
to discuss with this gentleman
> (although I think somebody _must_), I'm forwarding this URL to the list:
>
> http://www.winntmag.com/Magazine/Article.cfm?ArticleID=5048
>
> I hope you guys can address this article's issues properly. Please contact
> the author !!
While he (Mark Russinovich) does have valid points (no OS is perfect), he
is exagerating things alot at the expense of Linux, to make NT appear in a
better light. He also forgets to mention lots powerful mechanisms and
features present in the Linux kernel that give it an edge over NT, and he
forgets to mention shortcomings of NT in the same areas. But i think his
main mistake is that he is trying to find the NT API in Linux. No, Linux
is Linux. I'll try to address most of his technical points briefly:
'asynchron IO'
--------------
first he claims Linux has only select(), and then he continues to bash
select(). (without providing measurements or benchmark numbers) Then he
says that Linux _does_ have asynchron IO events implemented in 2.2 but
says that they have 'two major limitations'. Both 'limitations' he
mentions are in fact a pure implementation matter and not a mechanism or
API limitation. Mark also forgot to mention that Linux asynchron IO is
superior to NT because we do not have to poll the completion port for
events, we can have the IO event delivered _immediately_ to the target
thread (which is preempted by a signal if it's running). This gives more
flexibility of using asynchron events. (i have pointed out this difference
to him in private discussions, he left this argument unanswered)
'overscheduling'
----------------
here he again forgets to _prove_ that overscheduling happens in Linux.
Measurements have been done on big busy Linux webservers (much bigger than
the typical 'enterprise' category), and the runqueue lenghth (number of
threads competing for requests) was 3-4 typically. Enuff said ...
'kernel reentrancy'
-------------------
his example is a clear red herring. If any Linux application is
read()/write() intensive to the page cache, it should better use mmap(). I
can understand Mark did not mention mmap(), NT has a rather inferior
mmap() implementation. (eg. read()/write() and mmap()-ed modifications
done to the same file are not guaranteed to be data-coherent by NT ...)
His threading point is correct, there is still code left to be threaded
for SMP operation. Just as NT has one single big lock in it's networking
stack in NT4 SP4. (only SP5 has fixed this, which is not yet out of the
beta status.)
'sendfile'
----------
sendfile() is a new system call. The copying problem he noticed is true,
but it's a matter of the networking code, not some conceptual problem with
sendfile(). If the networking code does zero-copy then sendfile() will do
zero-copy as well. (without the user ever noticing) sendfile() will
certainly be further optimized in 2.3.
in private discussions with Mark i have pointed out most of these
counter-arguments, which he unfortunately failed to answer. He also didnt
answer my questions about NT's shortcomings in the above areas. (as
always, seemingly powerful concepts can often open up ugly ratholes)
Different OS, different approach. Let the numbers talk.
-- mingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: [OT] Comments to WinNT Mag !!
Linux Lists (lists@cyclades.com)
Fri, 30 Apr 1999 15:58:20 -0700 (PDT)
> in private discussions
with Mark i have pointed out most of these
> counter-arguments, which he unfortunately failed to answer. He also didnt
> answer my questions about NT's shortcomings in the above areas. (as
> always, seemingly powerful concepts can often open up ugly ratholes)
> Different OS, different approach. Let the numbers talk.
Ok, if you (and others) have already contacted him and showed him these
fallacies, good. However, I think there should be a _public_ place where
this clarification is available. Otherwise, misinformed people who read
this article will have the wrong idea of Linux, and they won't have any
other place that tells them otherwise.
Anyhow, thanks for your helpful reply again !!
Regards,
Ivan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
From: BROWN Nick <Nick.BR...@coe.int> Subject: RE: [OT] Comments to WinNT Mag !! Date: 1999/05/01 Message-ID: <fa.go20gsv.1b4csqc@ifi.uio.no>#1/1 X-Deja-AN: 472944913 Original-Date: Sat, 1 May 1999 20:01:20 +0200 Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-Id: <435C366F075ED211B12200204840172D703E46@PETITSUIX> To: "'Linux Lists'" <li...@cyclades.com>, Ingo Molnar <mi...@chiara.csoma.elte.hu> Content-Type: text/plain X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list Mime-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu >However, I think there should be a _public_ place where this clarification is available. I have corresponded briefly with Mark Russinovich on NT-related issues in the past, and he has always seemed a very fair-minded individual. He certainly isn't in Microsoft's pocket, for example. I would hope that if asked, he would contribute some follow-up to this forum. Of course, he makes his living from NT - both its strengths and its shortcomings. Nick Brown, Strasbourg, France (Nick(dot)Brown(at)coe(dot)int) email address updates : @coe.int replaces @coe.fr for more information, http://dct.coe.int/info/emfci001.htm - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: "Shane R. Stixrud" <sh...@souls.net> Subject: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd) Date: 1999/05/02 Message-ID: <fa.j7hkodv.1c0gabc@ifi.uio.no> X-Deja-AN: 473138027 Original-Date: Sun, 2 May 1999 03:52:39 -0700 (PDT) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.3.96.990502034629.15430A-100000@souls.net> To: linux-ker...@vger.rutgers.edu Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu I E-mailed Mark Russinovich and copied him on the "[OT] Comments to WinNT Mag !!" thread and suggested he respond. He sent me his response and requested that I forward it onto the list. Response below. ---------- Forwarded message ---------- Date: Sun, 02 May 1999 06:13:00 -0400 From: Mark Russinovich <m...@sysinternals.com> To: "Shane R. Stixrud" <sh...@souls.net> Subject: Re: [OT] Comments to WinNT Mag !! (fwd) Hi Shane, Please post my response to the list. At 01:24 PM 5/1/99 , you wrote: > >'asynchron IO' >-------------- > >first he claims Linux has only select(), and then he continues to bash >select(). (without providing measurements or benchmark numbers) Then he >says that Linux _does_ have asy nchron IO events implemented in 2.2 but >says that they have 'two major limitations'. Both 'limitations' he >mentions are in fact a pure implementation matter and not a mechanism or >API limitation. Mark also forgot to mention that Linux asynchron IO is >superior to NT because we do not have to poll the completion port for >events, we can have the IO event delivered _immediately_ to the target >thread (which is preempted by a signal if it's running). This gives more >flexibility of using asynchron events. (i have pointed out this difference >to him in private discussions, he left this argument unanswered) > Completion ports in NT require no polling and no linear searching - that, and their integration with the scheduler, is their entire reason for existence. Also, Linux's implementation of asynchronous I/O only applies to tty devices and to *new connections* on sockets - nothing else. Sure asynchronous I/O can be added to the rest of the I/O architecture (all of the deficiencies I bring up can, and I'm sure will, be addressed). My point is that it is currently very limited. >'overscheduling' >---------------- > >here he again forgets to _prove_ that overscheduling happens in Linux. >Measurements have been done on big busy Linux webservers (much bigger than >the typical 'enterprise' category), and the runqueue lenghth (number of >threads competing for requests) was 3-4 typically. Enuff said ... > Under high load environments even the short run-queue lengths you refer to are enough to degrade performance. And in the environments I'm talking about where there are several hundred requests being served concurrently, the run queue lengths for Linux are significantly higher with the implementation of a one-thread-to-one-client server model. >'kernel reentrancy' >------------------- > >his example is a clear red herring. If any Linux application is >read()/write() intensive to the page cache, it should better use mmap(). I >can understand Mark did not mention mmap(), NT has a rather inferior >mmap() implementation. (eg. read()/write() and mmap()-ed modifications >done to the same file are not guaranteed to be data-coherent by NT ...) >His threading point is correct, there is still code left to be threaded >for SMP operation. Just as NT has one single big lock in it's networking >stack in NT4 SP4. (only SP5 has fixed this, which is not yet out of the >beta status.) > First, serialization of long paths through the kernel degrade multiprocessor scalability - this is multiprocessing 101. You mention mmap, and I'm assuming you do so as an alternative to sendfile. Using mmap to serve files, the following is required: - the file is mapped with a call to mmap(). The kernel must manipulate the page tables of the process performing the map. - the process calls writev() to send an HTTP header in one buffer and file data from the mapped memory. This is another system call and two copies. There are 1-3 system calls (depending on whether the requested file has already been mapped, or another file must be unmapped to make room for the new mapping via mmap) , 2 buffer copies, and manipulation of the process page tables. The process must also manage its own file cache, unmapping and mapping files as needed. The file system is also performing the same management of the file system cache. BTW This isn't related to read-only file serving, but Linus admits that mmap in 2.2 has a flaw where write-backs to a modified file result in two copies instead of 1. He says that this will probably be fixed in 2.3.x. On the other hand, Sendfile on NT, Solaris, HP/UX and AIX are used as follows: - one call to sendfile() is made, and the call specifies buffers that serve as a prologue (e.g HTTP header) and epilogue to the file data, in addition to a file handle. The TCP stack sends the file data directly from the file system cache as a 0-copy send. The user buffers are also sent with the file data, and are not copied from user space, but locked into physical memory for the duration of the send. This implementation has 0 buffer copies and requires 1 system call to send an entire HTTP response. There is no manipulation of process address space, and the server need not manage its own file cache. In addition, the call can be made asynchronously, where waiting is done on a completion port that is waiting on new connections and more requests on existing connections. The asynchronous I/O model in NT extends to all I/O. NT (and Solaris, HP/UX, AIX) also have another API that Linux doesn't have yet: acceptex (the name of the NT version). This API is used to simultaneously perform an accept, the first read(), and geetpeer() in one system call. The advantages should be obvious. As for the Linux implementation of sendfile(), it does not support adding a header and the Linux TCP stack does not support 0-copy sends. Thus, there is an extra system call and buffer copy for a write() to send the header, and an extra buffer copy for sending the file. >'sendfile' >---------- > >sendfile() is a new system call. The copying problem he noticed is true, >but it's a matter of the networking code, not some conceptual problem with >sendfile(). If the networking code does zero-copy then sendfile() will do >zero-copy as well. (without the user ever noticing) sendfile() will >certainly be further optimized in 2.3. > Just to clarify, the Linux TCP/IP stack does not support 0-copy sending. See tcp_do_sendmsg() in net/ipv4/tcp.c. Note the calls to xx_copy_from_user() (the copy functions are macros defined in the architecture-specific include file uaccess.h). Like I said, I'm sure that over time the Linux problems will be fixed, but my article was about the state of Linux *today*, not next year or the year after. >in private discussions with Mark i have pointed out most of these >counter-arguments, which he unfortunately failed to answer. He also didnt >answer my questions about NT's shortcomings in the above areas. (as >always, seemingly powerful concepts can often open up ugly ratholes) >Different OS, different approach. Let the numbers talk. > I try to answer all e-mail that raise technical issues. If I failed to answer yours, Ingo, then it was simply because I was too busy. -Mark Mark Russinovich, Ph.D. NT Internals Columnist, Windows NT Magazine http://www.winntmag.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd)
Ingo Molnar (mingo@chiara.csoma.elte.hu)
Sun, 2 May 1999 14:12:39 +0200 (CEST)
On Sun, 2 May 1999, Mark Russinovich wrote:
> >first he claims Linux has only select(), and then he continues to bash
> >select(). (without providing measurements or benchmark numbers) Then he
> >says that Linux _does_ have asy nchron IO events implemented in 2.2 but
> >says that they have 'two major limitations'. Both 'limitations' he
> >mentions are in fact a pure implementation matter and not a mechanism or
> >API limitation. Mark also forgot to mention that Linux asynchron IO is
> >superior to NT because we do not have to poll the completion port for
> >events, we can have the IO event delivered _immediately_ to the target
> >thread (which is preempted by a signal if it's running). This gives more
> >flexibility of using asynchron events. (i have pointed out this difference
> >to him in private discussions, he left this argument unanswered)
> >
>
> Completion ports in NT require no polling and no linear searching - that,
> and their integration with the scheduler, is their entire reason for
> existence. [...]
they require a thread to block on completion ports, or to poll the status
of the completion port. NT gives no way to asynchronously send completion
events to a _running_ thread.
> [...] Also, Linux's implementation of asynchronous I/O only applies to
> tty devices and to *new connections* on sockets - nothing else. [...]
yes, networking is the main user of asynchronous events. Given that
asynchronous IO is rather new under Linux, it was a natural choice.
> Sure asynchronous I/O can be added to the rest of the I/O architecture
no. I personally think that networking is about the only place where this
technique has a long term future ... do you suggest that any 'enterprise
server' is IO-bound on block devices? But yes, it can be added. (squid for
one could benefit from it, but even squid is typically memory or disk seek
time limited)
> >here he again forgets to _prove_ that overscheduling happens in Linux.
> >Measurements have been done on big busy Linux webservers (much bigger than
> >the typical 'enterprise' category), and the runqueue lenghth (number of
> >threads competing for requests) was 3-4 typically. Enuff said ...
> >
>
> Under high load environments even the short run-queue lengths you refer to
> are enough to degrade performance. And in the environments I'm talking
> about where there are several hundred requests being served concurrently,
> the run queue lengths for Linux are significantly higher with the
> implementation of a one-thread-to-one-client server model.
do you suggest Dejanews does not work? You are often taking architectural
examples from the NT side, without measuring the Linux side. I actually
have a test-setup here that does 2000 new Apache connections a second
(over a real network), and no, we do not 'overschedule'.
It's often apples to oranges, and i'd really suggest you that before you
bash any architectural solution (in _any_ OS) as a 'severe limitation' you
better be damn sure right, or wear asbestos. I hope i'm not sounding
arrogant, _if_ we get into an overscheduling situation on the networking
side we already have plans to address it (with a few lines of change), but
currently it's not necessery. I have seen no request for discussion from
you on linux-kernel about overscheduling.
> >'kernel reentrancy'
> >-------------------
> >
> >his example is a clear red herring. If any Linux application is
> >read()/write() intensive to the page cache, it should better use mmap(). I
> >can understand Mark did not mention mmap(), NT has a rather inferior
> >mmap() implementation. (eg. read()/write() and mmap()-ed modifications
> >done to the same file are not guaranteed to be data-coherent by NT ...)
> >His threading point is correct, there is still code left to be threaded
> >for SMP operation. Just as NT has one single big lock in it's networking
> >stack in NT4 SP4. (only SP5 has fixed this, which is not yet out of the
> >beta status.)
>
> First, serialization of long paths through the kernel degrade
> multiprocessor scalability - this is multiprocessing 101.
yes, sure. Do they make an OS 'unable to handle the enterprise category',
nope. Just like NT's deficiencies do not necesserily make it incapable. As
i've explained to you, much of the Linux IO path (the interrupt part) goes
under a different lock.
> You mention mmap, and I'm assuming you do so as an alternative to sendfile.
not at all. You mentioned cached read()/write(), and i just pointed out
that if you do heavy cached read()s and write()s then you do the wrong
thing. I've attached a patch from David S. Miller that deserializes much
of the 'heavy' parts of reads and writes in the ext2, pipe, TCP and
AF_UNIX path. The patch adds 50 new lines. (just that people get a picture
about the magnitude of these 'severe limitations') But yes, Linux still
has a way to go.
> BTW This isn't related to read-only file serving, but Linus admits that
> mmap in 2.2 has a flaw where write-backs to a modified file result in two
> copies instead of 1. He says that this will probably be fixed in 2.3.x.
yes this is a known problem. (_this_ is what i consider to be one of the
top Linux problems, not the other ones you mention.)
> This implementation has 0 buffer copies and requires 1 system call to send
> an entire HTTP response. There is no manipulation of process address space,
> and the server need not manage its own file cache. In addition, the call
> can be made asynchronously, where waiting is done on a completion port that
> is waiting on new connections and more requests on existing connections.
> The asynchronous I/O model in NT extends to all I/O. NT (and Solaris,
> HP/UX, AIX) also have another API that Linux doesn't have yet: acceptex
> (the name of the NT version). This API is used to simultaneously perform an
> accept, the first read(), and geetpeer() in one system call. The advantages
> should be obvious.
_please_, could you time NT Solaris and HP/UX, how much they take for a
single sendfile() system call, and compare it to Linux null syscall
latencies? The Linux numbers are:
[mingo@moon l]$ ./lat_syscall null
Simple syscall: 0.8403 microseconds
one reason we made syscalls so lightweight is to avoid silly
'multi-purpose' conglomerate system calls like NT has. sendfile() has
mainly not been added to avoid system calls being done, but because it's
strong (and unique) conceptual foundations. Linux syscalls will be even
more lightweight in the future. (i have a prototype patch that makes them
cost 0.30 microseconds) Do you see the point, again an apples to oranges
problem.
> As for the Linux implementation of sendfile(), it does not support adding a
> header and the Linux TCP stack does not support 0-copy sends. Thus, there
> is an extra system call and buffer copy for a write() to send the header,
> and an extra buffer copy for sending the file.
[...]
> Just to clarify, the Linux TCP/IP stack does not support 0-copy sending.
zero-copy has backdraws too. (latency ones mainly) you seem to be very
much focused on bandwith, but thats not everything. Could you please
compare the latencies of the Linux and NT TCP stack? (i have) Or do you
believe that latency does not matter? But yes in certain circumstances we
want to have zero-copy. (sendfile is one such example)
> >in private discussions with Mark i have pointed out most of these
> >counter-arguments, which he unfortunately failed to answer. He also didnt
> >answer my questions about NT's shortcomings in the above areas. (as
> >always, seemingly powerful concepts can often open up ugly ratholes)
> >Different OS, different approach. Let the numbers talk.
>
> I try to answer all e-mail that raise technical issues. If I failed to
> answer yours, Ingo, then it was simply because I was too busy.
my major problem with your analysis is that in my opinion you paint a
one-sided picture, NT always on the 'winner' side, and Linux on the
'loser' side. Am i correct to understand that you consider Linux to be an
inferior design? I think there are two more technical issues you left
unanswered previously:
- CPU-specific optimizations. NT offers one single binary image for all
x86 CPU architectures. (barring the SMP/UP distinction) How do you explain
the speed penalty to your 'enterprise costumers'? The same holds for
CPU-specific assembly optimizations.
- NT's 'hidden locks'. Just as NT4 SP5 beta introduced 'deserialization'
silently into the networking code. (and certainly they claimed NT to be in
the 'enterprise category' years before) Are you 100% sure there are no
other NT subsystems left out 'accidentally' that make it incapable of
handling the load of 'enterprise class servers'. How can you be sure that
NT's TCP timers are scalable? You do not seem to _honor_ and balance the
fact that Linux has all it's source code out there, and thus yes all the
mistakes are visible. NT is basically a black box. You quote manuals from
NT instead of source code. Then you compare that to NT without doing
head-to-head measurements.
-- mingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@e-mind.com> Subject: Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd) Date: 1999/05/04 Message-ID: <fa.hm3a16v.3l8u1h@ifi.uio.no> X-Deja-AN: 473758917 Original-Date: Tue, 4 May 1999 02:04:08 +0200 (CEST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.4.05.9905040015100.615-100000@laser.random> References: <fa.j7hkodv.1c0gabc@ifi.uio.no> To: Mark Russinovich <m...@sysinternals.com> X-Sender: and...@laser.random X-Authentication-Warning: laser.random: andrea owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list X-Public-Key-URL: http://e-mind.com/~andrea/aa.asc MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Sun, 2 May 1999, Shane R. Stixrud wrote: >Under high load environments even the short run-queue lengths you refer to >are enough to degrade performance. And in the environments I'm talking If I understand well here you are complaining about the fact that Linux take a list of the running process and scan it at every schedule() to choose the next process to run. What I expect from changing the linked-list to a heap for example, would be to have a slowdown in the normal case of most of linux usage, to gain a 1/2% in the unlikely case of a machine that have more than some hundred of task running all the time. Maybe someday I'll try it but it's really a not obvious improvement to my eyes. Remeber also that the schedule() frequency will not increase with the increse of the number of running task in the system. >mapping files as needed. The file system is also performing the same >management of the file system cache. The sentence above make no sense to me (maybe due my bad English ;). It's the filesystem that manage directly the buffer cache while playing with metadata or for writes. >BTW This isn't related to read-only file serving, but Linus admits that >mmap in 2.2 has a flaw where write-backs to a modified file result in two >copies instead of 1. He says that this will probably be fixed in 2.3.x. Not true. Only if the area was just in the page cache you are going to do two copies. If you write to a file directly without read it you'll copy the file _only_ to the buffer cache. And don't tell me that you would like raw-IO (you don't want to copy to the buffer cache). Caching dirty buffers is strictly strictly needed for any kind of usage in order to have decent performances on _all_ hardware (... excluding ram-disks ;). But you missed a bit of you favour (it seems to me you don't know the details of the linux VM so well, otherwise you would have just mentioned this): the double copy applies also to normal writes not only to writes to mmapped regions. But again, such double copy may save us time in other places of the kernel and while it's true that the write(2) may (as just said it will be slower _only_ if there is a page cache for the written data) be slower, then the next read will be far faster than having to read directly from the buffer cache. So if you write one time and you'll read two times I think you would have just payed the cost of the double write. Theorically we could drop the page cache instead of doing the copy, but since the userspace data was just in the L1/L2 cache for sure it's better to do the copy at the write(2) time, then at after some time according to me. I don't know exactly the plans to replace the copy to the page cache but I am not too much worried about that. Just look at this: andrea@laser:~$ procinfo Linux 2.2.7 (root@laser) (gcc egcs-2.91.60) #60 [laser.(none)] [WARNING: can't print all IRQs, please recompile me] Memory: Total Used Free Shared Buffers Cached Mem: 127768 121276 6492 25524 19744 80304 Swap: 72256 0 72256 Bootup: Mon May 3 23:49:12 1999 Load average: 0.00 0.00 0.05 1/38 2697 user : 0:05:21.71 4.9% page in : 104101 disk 1: 19654r 5844w ^^^^^^ nice : 0:00:00.01 0.0% page out: 18738 ^^^^^ system: 0:05:00.32 4.6% swap in : 1 [..] See the two underlined lines. The one above is the procinfo for the machine I am running now but it's not a server, now I'll look how does it looks e-mind.com: andrea@penguin:~$ procinfo | grep page user : 1d 8:05:53.53 4.0% page in : 72314934 disk 1: 11007497r 6712837w ^^^^^^^^ nice : 0:44:20.53 0.1% page out: 36293162 ^^^^^^^^ Consider that e-mind.com does backup daily to a local HD and consider that while doing the backup it never read the backup-file, so the write will be only _to_ the buffer cache and we won't do the second copy to the page cache because it was not present in first place. >This implementation has 0 buffer copies and requires 1 system call to send >an entire HTTP response. There is no manipulation of process address space, Don't think that doing zero copy means an improvement. If you do a zero copy of 1gigabyte of data ok, but if you do a zero copy of a 1kbyte of data the issue is quite _much_ different. I did now a `netcat localhost chargen >/dev/null' just to see how much a zero copy would help a huge transfer of data. I left it running for some time and here it is the numbers: root@laser:/home/andrea# readprofile -m /System.old | sort -nr | head -20 4808 total 0.0084 565 csum_partial 3.6218 402 tcp_do_sendmsg 0.2332 364 sys_write 1.2466 272 system_call 4.2500 234 __generic_copy_from_user 3.6562 144 sock_sendmsg 0.8571 141 tcp_recvmsg 0.1034 106 __release_sock 0.8548 104 __strncpy_from_user 2.8889 97 csum_partial_copy_generic 0.4409 88 do_readv_writev 0.1897 84 schedule 0.0905 72 do_bottom_half 0.4500 71 ip_queue_xmit 0.0740 70 add_timer 0.1768 68 __wake_up 0.8500 66 tcp_rcv_established 0.0437 66 synchronize_irq 3.3000 66 kmem_cache_alloc 0.1897 While it's true that one of most frequented code is been copy_from_user, it's also true that the most of the time is been spent in csum_partial and in other mixed overhead where the small and fast copy_from_user got hided. Consider that I run the test in the worst case for a copy_from_user point of view. No network overhead (loopback) so no irq flood, no small writes. Just look at what happens running a different network load, again on the loopback device. I'll use the lat_tcp (tcp benchmark you can found in lmbench, if I understood well it does only a ping pong of packets, to measure the latency of the TCP stack, so no contiguous stream of data). Again no real network load, no disk access still many many conditions _against_ a copy_from_user approch. 4074 total 0.0071 325 tcp_recvmsg 0.2383 158 add_timer 0.3990 153 tcp_do_sendmsg 0.0887 144 schedule 0.1552 144 ip_fw_check 0.1390 143 tcp_rcv_established 0.0946 110 kmalloc 0.2723 105 kfree 0.2283 104 sys_write 0.3562 97 sys_read 0.4181 92 tcp_transmit_skb 0.0935 90 __kfree_skb 0.5357 86 ip_queue_xmit 0.0896 83 tcp_clean_rtx_queue 0.2767 83 do_bottom_half 0.5188 78 kmem_cache_alloc 0.2241 74 __wake_up 0.9250 71 skb_clone 0.4671 70 kmem_cache_free 0.1733 68 schedule_timeout 0.4857 68 kfree_skbmem 1.0625 66 ip_local_deliver 0.1130 [..] after a lot 26 synchronize_bh 0.3250 26 loopback_xmit 0.1327 25 __tcp_select_window 0.1420 24 __generic_copy_from_user 0.3750 ^^^^^^^^^^^^^^^^^^^^^^ 23 tcp_v4_send_check 0.2300 22 ipfw_output_check 0.1719 18 tcp_ack_saw_tstamp 0.0643 so definitely the zero copy in the TCP stack looks like not an issue to me. At least with a web load. A ftp load may be a bit different but I still think it would be not an issue considering that you won't run on a loopback but on a busy network card and you'll do "also" some access to disk etc..etc... Everything without considering the overhead of the code to do zero copy and the fact that such code will increase a lot the complexity of the TCP/stack. So without considering that we could have bugs and go slower in the iteractive case (like the ping-pong of lmbench) because of the overhead. >Like I said, I'm sure that over time the Linux problems will be fixed, but >my article was about the state of Linux *today*, not next year or the year >after. I instead think that the 0-copy in the TCP stack is a red-herring too. And even if it would be an obvious improvement it's not obvious that we want to increase the complexity of the code to achieve the thing. Comments? Andrea Arcangeli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd)
Alan Cox (alan@lxorguk.ukuu.org.uk)
Tue, 4 May 1999 12:04:39 +0100 (BST)
Wrong
Why do people even bother playing along with him 8)
> addition to a file handle. The TCP stack sends the file data directly from
> the file system cache as a 0-copy send. The user buffers are also sent with
> the file data, and are not copied from user space, but locked into physical
> memory for the duration of the send.
BTW I hope Solaris didnt do that, there is a classic sendfile machine
destroying attack when you use a lot of slow connections to jam a machine
up with locked down pages. Fun for all the family
Similarly most of his other arguments are based on highly theoretical views
of computing. One thing writing a real OS instead of writing about it teaches
people is that 99% of OS theory is complete and utter crud.
Zero copy is a good example. For many things zero copy actually reduces
performance, especially on SMP machines, due to the amount of memory handling
work on the page locking.
That is why many OS's only do sendfile() based zero copy.
Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli <and...@e-mind.com> Subject: Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd) Date: 1999/05/04 Message-ID: <fa.j5qmg0v.17lsnba@ifi.uio.no>#1/1 X-Deja-AN: 473927649 Original-Date: Tue, 4 May 1999 15:08:01 +0200 (CEST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.4.05.9905041507250.3158-100000@laser.random> References: <fa.jnhrduv.15g831v@ifi.uio.no> To: Alan Cox <a...@lxorguk.ukuu.org.uk> X-Sender: and...@laser.random X-Authentication-Warning: laser.random: andrea owned process doing -bs Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list X-Public-Key-URL: http://e-mind.com/~andrea/aa.asc MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Tue, 4 May 1999, Alan Cox wrote: >Similarly most of his other arguments are based on highly theoretical views >of computing. One thing writing a real OS instead of writing about it teaches >people is that 99% of OS theory is complete and utter crud. Agreed!! ;)) Andrea Arcangeli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: [OT] Comments to WinNT Mag !!
Alan Cox (alan@lxorguk.ukuu.org.uk)
Tue, 4 May 1999 17:25:17 +0100 (BST)
I've had
several discussions with him. He avoids questions about points that
don't fit his pet theory of OS design, he criticises anything that doesn't
follow his theory and often doesnt seem to understand it - eg he didnt
understand the kernel lock in 2.0.x SMP.
Im not sure he's pro NT as pro his pet theory and NT happens to match.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
From: "Mark H. Wood" <mw...@IUPUI.Edu> Subject: Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd) Date: 1999/05/04 Message-ID: <fa.jdosuhv.o0ad87@ifi.uio.no>#1/1 X-Deja-AN: 474056607 Original-Date: Tue, 4 May 1999 12:23:50 -0500 (EST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <Pine.LNX.4.05.9905041213300.16180-100000@mhw.ULib.IUPUI.Edu> References: <fa.ni9v06v.1jk0iot@ifi.uio.no> To: unlisted-recipients:; (no To-header on input) X-Sender: mw...@mhw.ULib.IUPUI.Edu Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu On Sun, 2 May 1999, Ingo Molnar wrote: > On Sun, 2 May 1999, Mark Russinovich wrote: > > Completion ports in NT require no polling and no linear searching - that, > > and their integration with the scheduler, is their entire reason for > > existence. [...] > > they require a thread to block on completion ports, or to poll the status > of the completion port. NT gives no way to asynchronously send completion > events to a _running_ thread. Ugh. I liked the VMS model here. When you queue an I/O request, one of the things you can attach to it is the address of a procedure. When the request completes, the kernel creates a temporary thread to execute the I/O rundown code, and part of that rundown is to call the procedure you supplied. Your procedure would typically move something from a wait queue to a work queue, or flip a bit in a bitmask, or link a buffer onto the free chain, or whatever it takes to indicate that your regular thread(s) should do whatever you want done when the I/O has completed. When you return, the rundown thread tidies up and destroys itself. (Of course, if you never return, or you try to do huge amounts of processing in your rundown procedure, your program won't work very well. Don't do that. Keep it short and simple.) -- Mark H. Wood, Lead System Programmer mw...@IUPUI.Edu Specializing in unusual perspectives for more than twenty years. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: kuz...@ms2.inr.ac.ru Subject: Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd) Date: 1999/05/04 Message-ID: <fa.dhdscev.vmg23g@ifi.uio.no>#1/1 X-Deja-AN: 474076740 Original-Date: Tue, 4 May 1999 20:37:25 +0400 (MSK DST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-Id: <199905041637.UAA14121@ms2.inr.ac.ru> References: <fa.jnhrduv.15g831v@ifi.uio.no> To: a...@lxorguk.UKuu.ORG.UK (Alan Cox) X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Hello! For God's sake could someone explain me, what is the difference between our sendfile() and plain write() from mmap()ed region? The only difference, which I see now is that sendfile() ALLOWS to make zero-copy NOT WORSE than usual write(), right? Resuming, sendfile() without zero-copy is pure cheating, if we added it to API it means that we plan to implement zero copy one day. 8) BTW is it really true, that NT transmitfile() does zero copy? I strongly suspect, it does not. Alexey - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd)
Alan Cox (alan@lxorguk.ukuu.org.uk)
Tue, 4 May 1999 19:01:21 +0100 (BST)
You have to take
the overhead of mapping the entire file and of TLB
shootdowns while setting up the VM with mmap but not with sendfile().
> Resuming, sendfile() without zero-copy is pure cheating,
> if we added it to API it means that we plan to implement zero copy
> one day. 8)
Yep
> BTW is it really true, that NT transmitfile() does zero copy?
> I strongly suspect, it does not.
NT5 beta claims to
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
From: David Miller <da...@twiddle.net> Subject: Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd) Date: 1999/05/05 Message-ID: <fa.f38ll5v.1d4k83t@ifi.uio.no>#1/1 X-Deja-AN: 474169674 Original-Date: Tue, 4 May 1999 15:45:40 -0700 Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-Id: <199905042245.PAA19632@piglet.twiddle.net> References: <fa.jqk7i6v.12m479v@ifi.uio.no> To: a...@lxorguk.ukuu.org.uk Original-References: <m10ejVS-0007...@the-village.bc.nu> X-Authentication-Warning: piglet.twiddle.net: davem set sender to da...@piglet.twiddle.net using -f X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list Reply-To: da...@redhat.com Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu From: a...@lxorguk.ukuu.org.uk (Alan Cox) Date: Tue, 4 May 1999 19:01:21 +0100 (BST) > BTW is it really true, that NT transmitfile() does zero copy? I > strongly suspect, it does not. NT5 beta claims to They can avoid the extraneous copy, but what they cannot do with most PC networking cards is avoid touching the data since most cards do not provide a hardware checksumming facility. Most of this would suggest that their existing architecture passes mbuf-chain-like buffers to the networking drivers in NT, or some other kind of scatter-gather list like scheme. This is the only way they could do zero-copy without driver updates from all the networking card vendors. Later, David S. Miller da...@redhat.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: Richard Gooch <rgo...@atnf.csiro.au> Subject: Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd) Date: 1999/05/05 Message-ID: <fa.i1p5a4v.jlg7ik@ifi.uio.no>#1/1 X-Deja-AN: 474183857 Original-Date: Wed, 5 May 1999 08:37:33 +1000 Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-Id: <199905042237.IAA08304@vindaloo.atnf.CSIRO.AU> References: <fa.jdosuhv.o0ad87@ifi.uio.no> To: "Mark H. Wood" <mw...@IUPUI.Edu> Original-References: <Pine.LNX.3.96.990502130955.21826D-200...@chiara.csoma.elte.hu> <Pine.LNX.4.05.9905041213300.16180-100...@mhw.ULib.IUPUI.Edu> Notfrom: spam...@atnf.csiro.au X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Mark H. Wood writes: > On Sun, 2 May 1999, Ingo Molnar wrote: > > On Sun, 2 May 1999, Mark Russinovich wrote: > > > Completion ports in NT require no polling and no linear searching - that, > > > and their integration with the scheduler, is their entire reason for > > > existence. [...] > > > > they require a thread to block on completion ports, or to poll the status > > of the completion port. NT gives no way to asynchronously send completion > > events to a _running_ thread. > > Ugh. I liked the VMS model here. When you queue an I/O request, > one of the things you can attach to it is the address of a > procedure. When the request completes, the kernel creates a > temporary thread to execute the I/O rundown code, and part of that > rundown is to call the procedure you supplied. Your procedure would > typically move something from a wait queue to a work queue, or flip > a bit in a bitmask, or link a buffer onto the free chain, or > whatever it takes to indicate that your regular thread(s) should do > whatever you want done when the I/O has completed. When you return, > the rundown thread tidies up and destroys itself. (Of course, if > you never return, or you try to do huge amounts of processing in > your rundown procedure, your program won't work very well. Don't do > that. Keep it short and simple.) What was the cost of creating the "temporary thread"? Anyway, we can do much the same thing with signals, except we don't need to create a temporary thread. Regards, Richard.... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
From: "Stephen C. Tweedie" <s...@redhat.com> Subject: Re: Mark Russinovich's reponse Was: [OT] Comments to WinNT Mag !! (fwd) Date: 1999/05/06 Message-ID: <fa.iphodiv.1322j1s@ifi.uio.no>#1/1 X-Deja-AN: 474905354 Original-Date: Thu, 6 May 1999 18:52:47 +0100 (BST) Sender: owner-linux-ker...@vger.rutgers.edu Original-Message-ID: <14129.55023.267354.584313@dukat.scot.redhat.com> Content-Transfer-Encoding: 7bit References: <fa.i1p5a4v.jlg7ik@ifi.uio.no> To: Richard Gooch <rgo...@atnf.csiro.au> Original-References: <Pine.LNX.3.96.990502130955.21826D-200...@chiara.csoma.elte.hu> <Pine.LNX.4.05.9905041213300.16180-100...@mhw.ULib.IUPUI.Edu> <199905042237.IAA08...@vindaloo.atnf.CSIRO.AU> Content-Type: text/plain; charset=us-ascii X-Orcpt: rfc822;linux-kernel-outgoing-dig Organization: Internet mailing list MIME-Version: 1.0 Newsgroups: fa.linux.kernel X-Loop: majord...@vger.rutgers.edu Hi, On Wed, 5 May 1999 08:37:33 +1000, Richard Gooch <rgo...@atnf.csiro.au> said: >> Ugh. I liked the VMS model here. When you queue an I/O request, >> one of the things you can attach to it is the address of a >> procedure. When the request completes, the kernel creates a >> temporary thread to execute the I/O rundown code, and part of that >> rundown is to call the procedure you supplied. > What was the cost of creating the "temporary thread"? There isn't one: the AST is scheduled in the context of the calling process/thread. AST delivery is an integral part of the scheduler, much like signal delivery is on Unix. > Anyway, we can do much the same thing with signals, except we don't > need to create a temporary thread. Yes --- ASTs are very similar to posix.4 queued signals. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/