The Big BSD Interview

The Big *BSD Interview

By Eugenia Loli-Queru
OSNews.com

October 8, 2001

Matt Dillon, not the famous actor but the kernel/VM FreeBSD hacker also well known for writting the Dice C compiler for the Amiga, is here with us today for an in-depth interview about everything regarding FreeBSD 5.0. This is the OS that all the techie people are waiting for and presenting it as the most advanced, technically-speaking, free OS of today. Additionally, we also include two mini interviews with Theo de Raadt, the OpenBSD founder, and Jun-ichiro "itojun" Hagino from the NetBSD Core Team.

Matt Dillon, VM and kernel FreeBSD team

1. What goodies BSD 5.0 is going to bring us?

Matt Dillon: There are at least a dozen projects going on in parallel and we have ripped up and replaced a great deal of code all over the kernel. So much so that we recently decided to extend the 5.0 release date another year to give the many projects a chance to get out of development mode and into stabilization mode. Due to this a good chunk of -current's features are also being slowly MFC'd into -stable. Not KSEs or SMPng though. I think Julian's comment in his KSE commit message was something on the order of "X-MFC after: ha ha ha ha" :-)

KSEs and SMPng are the most visible projects (SMPng is being spear-headed by John Baldwin), but we have also done a great deal of work on the network stack, checksum offloading, network drivers (especially GigE drivers), devfs (which is now the default in -current), crypto-quality random number generation, scaleability, and new machine ports. We are actively porting to IA64, PowerPC, and Sparc64, and even though the Alpha is dying away our already-operational (in -stable) alpha port is being actively developed to ensure that our code remains 64 bit clean and to provide ground-breaking work for the other ports.

A good chunk of the work has already been MFCd to stable. For example, -stable can push 900+ MBits (120 MBytes/sec 'netstat -in 1') over a TCP connection on a DELL2550 with GigE using *normal* sized frames (mtu 1500). That's full saturation.

A great deal of filesystem work has also been completed. Filesystem snapshot support (for UFS+Softupdates) is progressing nicely and may prove to be one of the most important new filesystem features in the 5.x system once it stabilizes. There are also two new native features in UFS that have stabilized and have in fact been MFCd to -stable: dirpref and dirhash. dirpref comes directly from Grigoriy Orlov of OpenBSD and reworks the way directories are layed-out on disk, resulting in huge directory and file stat/open/create/remove performance gains. Some filesystem operations have improved by over 60x (6000%), and many of the common ones have improved over 400%. dirhash is a very low-overhead in-kernel whole-directory hashing mechanism that radically improves the performance of directory operations.

There is much more. Some things, like dirhash and dirpref, can be easily MFC'd to -stable. Others, such as softupdate's snapshot support, probably won't be due to major dependancies on other current-only projects.

2. Which is the one feature that you would like most to add to the BSD kernel?

Matt Dillon: I would like to see a native process and device-level descriptor migration capability. This isn't a new idea (few ideas in computer architecture can ever be called 'new'), but it is an idea whos time has come. I would like to see an ability to migrate processes as well as their device-level state and couple that with external rerouting. So an I/O descriptor representing a TCP connection could be migrated entirely off the original machine, for example.

Process migration is a good basis to support Q.O.S. and maintainance issues on todays platforms. As computing hardware becomes more powerful and we run more services (and more connections, and more users) on any given box, the ability to migrate everything off a box in order to take it down for maintainance without users noticing that you are doing it has become the ultimate IT grail. The reason is simple: if you use a modern machine to its fullest potential and the system crashes, you are potentially interrupting thousands of users rather then simply dozens. The concept of the 'maintainance-window' introduces the same problem, even with load balancing and connection management and distribution technologies. The 'maintainance-window' concept is rapidly becoming unacceptable in today's full-on world.

In short, process migration would allow the open-source community to begin to provide Q.O.S. levels that only mainframes can provide today. And if you didn't hear me mention so-called 'clustering' solutions currently available from unnamed vendors, it's because they can't actually deliver these things -- not true Q.O.S. That's my opinion, anyway. Using a cluster to hide the fact that the underlying systems crash regularly is an extremely dangerous way to manage a computing environment.

I'm going to cheat a bit and also give you my #2 feature-wish: I want native filesystem replication. I don't care a whit about common server-based disk store: you don't get reliability or scaleability that way. I want to see distributed (replicated, not partitioned) filesystems that are transactionally coherent, to go along with the process-migration of course :-)

3. Soft-Update seems to be one step further than Journaling, it is the "modern" way of doing journaling, and FreeBSD has that feature. However, do you have plans to add to the FreeBSD fs some of the features found on XFS or JFS systems?

Matt Dillon: I would not characterize soft-updates as being a step-ahead of journaling. At least not meta-data journaling. It's just another way of doing things. Even though soft-updates can theoretically perform better then even meta-data journalling the plain fact of the matter is that linear disk bandwidth has at least 25x the throughput of a random seek/write. So journaling meta-data has a fairly small performance impact if you can asynchronize everything *else*. Softupdates works extremely well for UFS but the softupdates concept can break down with other filesystems - it could very well be impossible to implement softupdates-like operation on a filesystem which implements directories as BTree's or hashes, for example. On the other hand softupdates can commit meta-data operations out of order while still maintaining filesystem integrity, and it can do it in an infinitely fine-grained fashion which naturally leads to better parallelism. Journaled filesystems typically can't do that. So the usefullness of the theory depends heavily on what your goals are. For general purpose work both theories work equally well.

Most filesystem-specific 'super' features are highly specialized and not actually useful in the vast majority of system installations. XFS has data zoning features and (at least under IRIX) the ability to guarentee data stream latency and bandwidth. I can count the number of applications that actually need those features on one hand with a few fingers cut off. XFS's major advantage, as with all journaled filesystems, is instant crash recovery. All else being equal this is a journaled filesystem's biggest advantage for general purpose computing but, even so, supplying the proper options to newfs when creating a UFS filesystem can drop fsck times by an order of magnitude on large filesystems. People using UFS are not really at that much of a disadvantage. You can't provide any sort of Q.O.S. if you depend on fast crash recovery to be fast. Q.O.S. means having redundant hardware at the very least. I can't comment on JFS, I've never used it.

All of the BSD camps make stability priority #1 and performance priority #2. Performance and fast crash recovery is completely irrelevant if the filesystem corrupts the data or causes a crash! This is especially true as HD capacities increase and filesystems become larger. I have never quite understood why the Linux community gets so revved up by the huge number of filesystems they support. As if the sheer number combine together to provide a more effective system! You don't get reliability, performance, and long term stability by playing with filesystems, you get it by choosing or focusing on one or two filesystems that deliver those characteristics. Depending on filesystem-specific 'super' features makes code non-portable and is not usually a good idea.

In anycase, most BSD developers are happy with UFS. Oh, when I say UFS I really mean UFS+FFS or UFS+FFS+SOFTUPDATES. UFS is not the ancient creaking beast that some people have stereotyped it as. The basic theory and structure was sound and is still sound to this very day. Over the years we've fixed bugs (what few bugs we find), added capability support, better caching, reorganized the layout in a backwards-compatible fashion, re-introduced reblocking (basically on-the-fly defragmentation), softupdates, snapshot support, etc etc etc.

4. After the open source bubble bursted recently, a lot of companies seized support and stoped contributing code to both Linux or BSD. How has this affected the BSD development?

Matt Dillon: It creates a short term disruption for the people involved in regards to their ability to contribute but I do not believe company layoffs will have any effect on the open-source movement itself or on Linux and BSD development in the long term. The biggest contributors to open-source are not staple employees of a company who are hired specifically to interact with the open-source community. They are people who have a real interest and love of open-source who happen to be working at a company in a leverageable position.

While there have been BSD related layoffs, it's nothing that was unexpected and has had much less of an impact on us then I'm sure the huge number of linux-centric companies going bust has had on the Linux psyche. All I can say is: It aint our (the open-source community) fault. Most of the linux centric companies were leeching off the linux name, and those that weren't didn't fail because they were using Linux, they failed because they didn't have a business model with a chance in hell of (ever) going profitable. Open-source operates behind the scenes far more then it operates in the public eye, and it's hard to sell support to hackers who actually have *fun* trying to figure out a problem. In some respects Linux and the BSDs are poor commercialization candidates because they are *too* good... that they simply do not require the level of support that something like Windows-NT or Oracle might require in a back-office setting.

Open source has created far more disruption and change in commercial interests then the other way around. I think it has been for the better, though I'm sure many commercial entities (such as MS) aren't too happy about being forced to be more honest with their customers. (hmm... actually I think they still haven't learned, and look at the effect. MS has gotten its fingers burned so many times in their dirty war against open-source that even long-time commercial partners don't believe what they say any more!).

5. How do you feel that Linux got most of the attention the last couple of years, and it was able to move a bit faster to the desktop arena? Is the Desktop market interest at all the FreeBSD people?

Matt Dillon: I find it to be an interesting exercise in social engineering, economics, and psychology. Oh, you want to know what I *really* think?

I think biggest winner here is open-source. A great deal of what people label as 'Linux' isn't actually Linux. It's open-source that compiles just as easily on FreeBSD (*without* linux emulation) as it does on Linux. Take GNOME and KDE for example. No linux emulation necessary there! The areas where FreeBSD has problems are almost entirely relegated to commercial binary-only distributions. Now, that said, Linux is certainly the largest driver of interest that leads to the development of many of these projects. I don't think we would have GNOME or KDE without Linux. As a driver of interest Linux has earned its place at the top of heap.

In regards to the desktop... well, I'm not sure exactly what you are asking. Both Linux and FreeBSD are in the same boat there... the only way to drive desktop acceptance is to ship machines pre-installed with the OS (whatever OS) and preconfigured with a desktop so when you turn the thing on, you are ready to rock. The only way to do that is for the PC vendors to pre-install Linux (or FreeBSD, or whatever).

Other then that common issue, there really is no difference between FreeBSD and Linux in regards to the desktop. Oh, we could integrate the sound a little better and it would be nice to get a native OpenGL implementation working, but everything else is already there, because both platforms are running the same GUI software.

6. Please explain to us what SMPng (next-generation symmetric multi-processing) and KSE (kernel scheduler entities) are, which are features to be found on the BSD-5-Current.

Matt Dillon: SMPng is FreeBSD's fine-grained mutex, interrupt threading, and Giant-removal implementation. Potentially kernel pre-emption is also part of the equation but the jury is still out on that. The purpose is to be able to have several mainline processes and/or interrupts operating in kernel mode simultaniously. This is the primary scaleability issue in any SMP system. The work being done here is roughly compareable to the SMP work being done in Linux. Linux is about a year ahead of us but both Linux and the BSDs have a great deal of work to do to catch up with Solaris.

KSE is a totally new (but old idea) way of implementing userland threads. The idea here is two fold: (1) to remove any requirement that userland code understand which system calls might block and which system calls might not block. (2) to do all primary thread scheduling and switching in userland, where any given cpu can switch between threads with approximately the same overhead as a userland subroutine call.

With KSEs if a userland process makes a system call which blocks, the kernel will detach the kernel context (which is now blocked) and return directly to the user mode scheduler using an 'upcall'. The userland scheduler can then immediately switch to another thread. Another system call will be given a new, fresh, KSE to play with. The blocked kernel context runs completely asynchronously from the userland process until it finishes and can potentially run concurrently with other detached KSEs for the same process. When a KSE completes the kernel notifies the userland scheduler allowing the userland scheduler to reschedule the 'blocked' thread which is now 'returning' from the system call that originally blocked.

The essential difference between KSEs and both select/kqueue-based threads and rfork based threads is that with KSEs you get all the parallelism of the SMP box and all the power of a userland-only context switch between threads (read: *very* fast switch times) without *any* of the kernel overhead. A program can literally be running thousands of threads with no significant kernel overhead. Only blocked system calls eat kernel resources. In addition to this, we can manage kernel resources in the face of thousands of threads by limiting the 'pool' of KSEs we assign to any given process or user or whatever. So if 500 of those 1000 threads block in a syscall we just get a little less cpu-efficient and don't blow out kernel memory.

Currently FreeBSD can use both select/kqueue and rfork (linux-style) threading. KSEs bring us to the next level.

7. From the technical point of view, how would you rate the Linux 2.4 kernel compared to BSD's?

Matt Dillon: I don't know enough about recent linux kernels to be able to rate them, nor would it be P.C. I do follow the VM work being done in Linux and in particular Rik van Riel's work. I think Linux is going through a somewhat painful transition as it moves away from a Wild-West/Darwinist development methodology into something a bit more thoughtful. I will admit to wanting to take a clue-bat to some of the people arguing against Rik's VM work who simply do not understand the difference between optimizing a few nanoseconds out of a routine that is rarely called verses spending a few extra cpu cycles to choose the best pages to recycle in order to avoid disk I/O that would cost tens of millions of cpu cycles later on. It is an attitude I had when I was maybe 16 years old... that every clock cycle matters no matter how its spent. Bull!

8. How is the "relationship" between the FreeBSD programmers and the OpenBSD/NetBSD ones? Do you share code, opinions, chatting regularly? Or all these BSD projects are completely independant to each other?

Matt Dillon: The BSD groups are like high school social circles. No, really! That's the best analogy I can think of! Many developers focus on just their little clique but a good chunk run in multiple circles. There are developers that maintain the same driver code across several BSD distributions. There are developers who focus their work in one BSD distribution but have ties to developers in others. If the work is interesting enough, such as the 'dirpref' work, developers that focus on coding in other BSD distributions will pick up the patch set and bring it in. That is how FreeBSD got the dirpref code. Kirk imported it from OpenBSD into FreeBSD-current and I MFC'd it to -stable after it had been proven out in -current.

In many respects this development methodology gives us the best of both worlds. Developers are free to focus on the distribution they are most familar with and if the work is interesting enough it gets several eyes from the other distributions who not only port the code in, but also review it. Testing can wind up occuring in all the distributions simultaniously and with something like 'dirpref', if someone finds a bug it will almost certainly wind up being fixed in the other distributions within a few days. Security bugs are independantly verified but often the fix is common to all the BSDs and no duplicate work need occur. There is constant borrowing going on between the BSDs and even between BSD and Linux, especially in regards to driver code.

9. What is your opinion on .NET and do you think that it may be possible that .NET change the OS "map" as we know it?

Matt Dillon: I believe .NET is Vapor. It's a marketing term dreamed up by Microsoft that will magically morph into whatever Microsoft eventually winds up delivering. MS announces grandiose ideas with cute catch phrases all the time, and as with any good vapor there is always some basis in truth (if only a little pinprick). The reality is a little different though... remember, these are the people that hyped windows-ME up the wazoo and all we got out of it was a speech-synthesized windows installation wizard! These are the people that called NT the unix-killer and told people it was as reliable as UNIX. .NOT is probably a more descriptive term for .NET. My guess is that it will turn into Microsoft-proprietary rent-a-service glue, and that it will introduce an order of magnitude more security issues then IIS.

10. Some say that FreeBSD has the best VM ever [ http://www.daemonnews.org/200001/freebsd_vm.html ], whem compared to any other Operating System. Do you think that there is still space for improvement and are there still features to be added?

Matt Dillon: I think we made great progress stabilizing the VM system and working out performance issues related to machine scaling in the -4.x series of FreeBSD releases. The machines have proven to be great workhorses in a wide range of applications and are able to provide the long term stability and performance required by its users. Generally speaking, the technology behind the VM system is quite sound and does not need much more in the way of improvement. Obviously in -5.x we will be multi-threading pieces of it for SMPng, but the core algorithms appear to extend cleanly to MP and 64 bit platforms and we do not expect to have to make any fundamental changes. There is always room for improvement, of course! While we are likely to stand pat with the VM core in early 5.x releases, there is a great deal of work planned to improve the I/O and buffer cache subsystems a little later on. My personal goal is to eventually remove the buffer cache entirely or at least morph it into nothing more complex than an I/O staging subsystem.

Itojun, NetBSD Core Team

1. How is the "relationship" between the NetBSD programmers and the OpenBSD/FreeBSD ones? Do you share code, opinions, chatting regularly? Or all these BSD projects are completely independant to each other?

Itojun: Yes we do chat with each other and share code/opinions. Some of the developers do have commit access (can modify source code tree) for multiple BSDs.

2. Do you incorporate code to NetBSD from OpenBSD or FreeBSD when important changes are made to these OSes?

Itojun: Yes, but depending on the characteristics of the changes. If it is a one-line change for security issue, we'd integrate them right away. If it is a big feature addition, we review them carefully and sometimes do integrate the changes, sometime do not (we get similar changes from others, we implement it ourselves, or integrate it with lot of improvements).

3. What goodies the next version of NetBSD is scheduled to bring us?

Itojun: SMP (for multiple platforms!) and fine-grained thread support are the biggest targets we are attacking. More platforms support, of course.

4. NetBSD's goal is to port the OS to as many platforms as it can. Which platforms are still needed NetBSD to be ported and it is a priority to do so?

Itojun: Sony PlayStation2 (port exists, needs integration).

Theo de Raadt, OpenBSD Founder

1. How is the "relationship" between the OpenBSD programmers and the FreeBSD/NetBSD ones? Do you share code, opinions, chatting regularly? Or all these BSD projects are completely independant to each other?

Theo de Raadt: There are no formal relationships of any kind. That said, since it is a free world, there are numerous developers who do talk to their counterparts in the other group. Even when that does not happen, public mailing lists and the mainstay product of our projects -- source code -- is completely visible. What more could one want?

2. Do you incorporate code to OpenBSD from NetBSD or FreeBSD when important changes are made to these OSes?

Theo de Raadt: Sure, why wouldn't we?

3. What goodies the next version of OpenBSD is scheduled to bring us?

Theo de Raadt: First off, I should reiterate what I have been saying for 5 years: OpenBSD development is not revolutionary, but evolutionary. That means that between one release and another, not a lot of big things happen, but instead we should view it as a series of about 10,000 - 20,000 small changes. Over a series of OpenBSD releases, this amounts to a very big deal. Any release from 2 years back feels very different from the current codebase we have, but actually labelling the big changes between two consecutive releases is very difficult. Thousands of these changes are bug fixes, minor conformance improvements... things which I would argue matter MUCH MORE than "new features".

That said, this next release has one big thing that people are waiting to try out: We have written a whole new packet filter / nat engine, and fully integrated it into the system. People who are used to ipf will find that pf is much like ipf, but has some improvements which we have always wanted to make (and which the old ipf license had blocked us from doing).

The alpha port has been significantly improved to support many of the higher end models (kind of funny considering the entire platform is now end of lifed...), and we will be releasing our first ultrasparc beta.

Other than that and the thousands of little fixes and improvements everywhere, and probably a bunch of other things I have already forgotten,

4. OpenBSD's goal is to bring ultimate security to a server. By patching the holes and only accepting proved software do you think that it keeps your development moving slow from implementing something new to the OS level and releasing it pretty fast?

Theo de Raadt: No, I think it does not affect or release schedule or development process.