Design Notes on Asynchronous I/O (aio) for Linux

Design Notes on Asynchronous I/O (aio) for Linux
-----------------------------------------------

Date: 1/30/2002

- Based on Benjamin LaHaise's in kernel aio patches 
  http://www.kernel.org/pub/linux/kernel/people/bcrl/aio
  http://www.kvack.org/~blah/aio
  http://people.redhat.com/bcrl/aio
  (Last version up at the point of writing this: aio-0.3.7)
- Refers to some other earlier aio implementations, interfaces on some
  other operating systems, and DAFS specifications for context/comparison.
- Includes several inputs and/or review comments from 
	Ben LaHaise (overall)
	Andi Kleen (experiences from an alternative aio design)
	Al Viro (i/o multiplexor namespace)
	John Myers (concurrency control, prioritized event delivery)

  but any errors/inaccuracies are mine, and feedback or further inputs 
  are more than welcome.

  Regards 
  Suparna Bhattacharya (suparna@in.ibm.com)
  Linux Technology Centre
  IBM Software Labs, India

Contents:
----------
1. Motivation
  1.1 Where aio could be used
  1.2 Things that aio helps with
  1.3 Alternatives to aio
2. Design Philosophy and Interface Design
  2.1 System and interface design philosophy
  2.2 Approaches for implementing aio
  2.3 Extent of true async behaviour - queue depth/throttle points
  2.4 Sequencing of aio operations
  2.5 Completion/readiness notification
  2.6 Wakeup policies for event notification
  2.7 Other goals
3. Interfaces
  3.1 Interfaces provided by this implementation
  3.2 Extending the interfaces
4. Design Internals
  4.1 Low level primitives
  4.2 Generic async event handling pieces
  4.3 In-kernel interfaces
  4.4 Async poll
  4.5 Raw disk aio
  4.6 Filesystem/buffered aio
  4.7 [Placeholder for network aio]
  4.8 Extending aio to other operations (todo)
5. Placeholder for performance characteristics
6. Todo Items/Pending Issues

------------------------------------------------------------------

1. Motivation

Asynchronous i/o  overlaps application processing with i/o operations
for improved utilization of CPU and devices, and improved application 
performance, in a dynamic/adaptive manner, especially under high loads 
involving large numbers of i/o operations.

1.1 Where aio could be used:

Application performance and scalable connection management:
(a) Communications aio:
  Web Servers, Proxy servers, LDAP servers, X-server
(b) Disk/File aio:
  Databases, I/O intensive applications
(c) Combination
  Streaming content servers (video/audio/web/ftp)
  (transfering/serving data/files directly between disk and network)

Note:
The POSIX spec has examples of using aio in a journalization model, a data 
acquisition model and in supercomputing applications. It mentions that
supercomputing and database architectures may often have specialized h/w
that can provide true asynchrony underlying the logical aio interface. 
Aio enables an application to keep a device busy (e.g. raw i/o), potentially 
improving throughput. While maximum gains are likely to be for unbuffered 
i/o case, aio should be supported by all types of files and devices in 
the same standard manner.

Besides, being able to do things like hands off zero-copy async sendfile 
can be quite useful for streaming content servers.

1.2 Things that aio helps with:

- Ability for a thread to initiate operations or trigger actions  
  without having to wait for them to complete. 
- Ability to queue up batches of operations and later issue a single wait to
  wait for completion of any of operations or at least a certain number of
  operations (Note: Currently its only "at least one" that's supported). 
- Multiplexing large no of connections or input sources in a scalable manner
  typically into an event driven service model.
  [This can significantly reduce the cost of idle connections, which 
   could be important in protocols like IMAP or IRC for example where 
   connections may be idle for most of the time]
- Flexible/dynamic concurrency control tuning and load balancing.
- Performance implications
  (a) Application thread gets to utilize its CPU time better 
  (b) Avoids overhead of extra threads (8KB per kernel thread in linux)
  (c) System throughput helped by reducing context switches (since wait causes
  less than time-slice runs)
- Ability to perform true zero-copy network i/o on arbitrary user buffers
  Currently sendfile or an in-kernel server is the only clean way to use the
  zero-copy networking features of 2.4. The async i/o api would enable
  extending this to arbitary user buffers.
  [Note: Standard O_NONBLOCK doesn't help as the API doesn't take the buffer
  away from the user. As a result the kernel can avoid a copy only if it
  MMU write protects the buffer and relies on COW to avoid overwrites while 
  the buffer is in use. This would be rather expensive due to the TLB
  flushing requirements, especially as it involves IPIs on SMP. ]

1.2.1 Other expected features (aka POSIX):
- Support for synchronous polling as well as asynchronous notification 
  (signals/callbacks) of completion status, with ability to co-relate 
  event(s) with the i/o request(s).
  [Note: Right now async notification is not performed by the core kernel aio
  implementation, but delivered via glibc userspace threads which wait for 
  events and then signal the application. 
  TBD: There are some suggestions for a direct signal delivery mechanism from 
  the kernel for aio requests to avoid the pthreads overhead for some users 
  of POSIX aio which use SIGEV_SIGNAL and do not link with the pthreads 
  library. Possibly a SIGEV_EVENT opcode could be introduced to make the
  native API closer to a POSIX extension.
  ]
- Allow multiple outstanding aio's to the same open instance and to multiple 
  open instances (sequence might be affected by synchronized data integrity
  requirements or priorities) 
  [Note: Currently there are firmer guarantees on ordering for sockets by the 
  in-kernel aio, while for file/disk aio barrier operations may need 
  to be added in the future]
- Option to wait for notification of aio and non-aio events through a single
  interface
  [TBD: Ties in with Ben's recent idea of implementing userland wait queues ]
- Support for cancellation of outstanding i/o requests
  [Note: Not implemented as yet but in plan (just done for aio_poll, others
  to follow). Cancellation can by its very nature only be best effort]
- Specification of relative priorities of aio requests (optional)
  [Note: Not implemented as yet. Should be linked to the new priority based 
   disk i/o scheduler when that happens]

1.2.2 Also Desirable:
- Ability to drive certain sequences of related async operations/transfers 
  in one shot from an application e.g. zero-copy async transfers across 
  devices (zero-copy sendfile aio)


1.3 Alternatives to aio 

1.Using more threads (has its costs) 
	- static committed resource overhead per thread
	- potentially more context switches
2.Communications aio alternatives - /dev/*poll 
	- specialized device node based interface for registration and
	  notifications of events
  	- suitable for readiness notification on sockets, but not for 
	  driving i/o.
3.Real-time signals 
	- only a notification mechanism 
	- requires fcntl (F_SETSIG) for edge triggered readiness notification 
	  enablement or aio interfaces (aio_sigevent settings: SIGEV_SIGNAL) 
	  for completion notification enablement through RT signals.
	- the mechanism has potential overflow issues (when signal queue 
	  limits are hit) where signals could get lost, especially with 
	  fasync route (which tends to generate a signal for every event 
	  rather than aggregate for an fd) and needs to be supplemented with 
	  some other form of polling over the sigtimedwait interface.
	  The only way to tune the queue lengths is via sysctl.
	- relatively heavy when it comes to large numbers of events
	  (btw, signal delivery with signal handlers is costly and not very 
	  suitable for this purpose because of the complications of locking 
	  against them in user space; so the sigtimedwait sort of
	  interface is preferable) 
	- there are some other problems with flexibility in setting the 
	  receipient of the signal via F_SETOWN (per process queues) which
	  hinders concurrency.
 [Question to Ponder: More efficient implementation and extensions to RT signal
 interfaces, or have a different interface altogether ? ]

 Please refer to www.kegel.com/c10k.html for a far more detailed coverage of
 these mechanisms, and how they can be used by applications.

Reasons for prefering aio:
- Desirable to have a unified approach, rather than multiple isolated 
  mechanisms if it can be done efficiently 
- Multiplexing across different kinds of operations and sources
- Clear cut well-known system call interface preferable to more indirect
  interfaces
- Driving optimizations from low level/core primitives can be more efficient 
  and beneficial across multiple subsystems 
- Can separate the core event completion queue and notification mechanisms for
  flexiblity and efficiency. (Can have tunable wakeup semantics, tunable 
  queue lengths, more efficient event ring buffer implementation)
  Note: There are synchronization concerns when the two are not
  unified from a caller's perpective though, so the interfaces need to
  be designed with that in mind.


2. Design Philosophy and Interface Design

2.1 System and Interface design philosophy:
Alternatives:
a. Entire system built on an asynchronous model, all the way through
  (e.g NT i/o subsystem). So most operations can be invoked in sync or async
  mode (sub-options of the same operation specific interface). 
  Internally, the sync mode = async mode + wait for completion.
b. Async operations are initiated through a separate interface, and could
  follow a separate path from the synchronous operations, to a degree
  (use common code, and low down things may be truly async and common for
  for both, but at the higher level the paths could be different)
 
The POSIX aio interface is aligned with (b). This is the approach that the
Linux implementation takes. Submission of all async i/o ops happens 
through a single call with different command options, and data used for 
representing different operations.

Advantages:
- No change in existing sync interfaces (can't afford to do that anyway)
- Less impact on existing sync i/o path. This code does not have the overhead
  of maintaing async state (can use the stack), and can stay simple.

Disadvantages:
- Need to introduce interfaces or cmd structures for each operation 
  that can be async. (A little akin to an ioctl style approach)
- Different code paths implies some amount of duplication/maintenance 
  concerns. Can be minimized by using as much common code as possible.

2.2 Approaches for implementing aio

2.2.1 Alternative ways of driving the operation to completion 

1. Using threads to make things _look_ async to the application
  a. User level threads 
	- glibc approach (one user thread per operation ?)
	  poor scalability, performance
  b. Pool of threads 
	- have a pool of threads servicing an aio request queue for the
	  task - tradeof between degree of concurrency/utilization and 
	  resource consumption.
2. Hybrid approach (SGI kaio uses this)
  - If the underlying operation is async in nature, initiate it right away
    (better utilization of underlying device), and just handle waiting for 
    completion via thread pool (could become a serialization point depending
    on load and number of threads) unless operation completes rightaway in a 
    non-blocking manner.
  - If underlying operation is sync, then initiate it via the thread pool
  Note:
  - SGI kaio has internal async i/o initiation interfaces for raw i/o and
    generic read. 
  - SGI kaio has these slave threads in the context of the aio task => at
    least one per task 
  - SGI kaio slave threads perform a blocking wait for the operation 
    just dequeued to complete before checking for completion of the next 
    operation in the queue => number slave threads determines the degree of
    asynchrony. 
3. Implement a true async state machine for each type of aio operation.
   (i.e a sequence of non-blocking steps, continuation driven by IRQ and event 
   threads, based on low level primitives designed for this purpose)
  - Relatively harder to get right, and harder to debug, but provides 
    more flexibility, and greater asynchrony

This aio implementation takes approach 3 (with some caveats, as we shall see
later).

Andi Kleen had experimented with a new raw i/o device which would be
truly async from the application's perspective until it had to block waiting 
for request queue slots. Instead of using a thread for completion as in
approach 3 above, it sent RT signals to the application to signal i/o 
completion. The experience indicated that RT signals didn't seem to very
suitable and synchronization was rather complicated. There also were
problems with flow control with the elevator (application blocking on
request queue slots and plugging issues).

A paper from Univ of Wisconsin-Madison talks about a block async i/o 
implementation called BAIO. This scheme uses one slave thread per task
similar to the SGI kaio approach, but in this case the BAIO service thread
checks for completion in a non-blocking manner (it gets notified of i/o
completion by the device driver) and in turn notifies the application. 
BAIO does not have to deal with synchronous underlying operations 
(doesn't access filesystems, as it only intends to expose a low level 
disk access mechanism enabling customized user level filesystems), and 
hence its async state machine is simple.

2.2.1.1 Optimization/Fast-path for non-blocking case

In case an operation can complete in a non-blocking manner via the 
normal path, the additional async state path can be avoided. An F_ATOMIC 
flag check has been introduced down the sync i/o path to check for this, 
thus providing a fast path for aio. This idea comes from TUX.

2.2.2 Handling User Space Data Tranfer

With asynchronous i/o, steps of the operation aren't guaranteed to execute
in the caller's context. Hence transfers/copies to/from user space need to be
handled carefully. Most of this discussion is relevant for buffered i/o,
since direct i/o avoids user/kernel space data copies.

In a thread pool approach, if a per-task thread pool is used, then such 
transfers can happen in the context of one of these threads. Typically 
the copy_to_user operations required to read transfered data into user 
space buffers after i/o completion would be handled by these aio threads.
Both SGI kaio and BAIO rely on per-task service threads for this purpose.

It may be possible to pass down all the user space data for the operation
when initiating i/o while in the caller's context without blocking, though
this is inherently likely to use extra kernel space memory. The same is
true on the way up on i/o completion, where it may be possible to continue
holding on to the in-kernel buffers until the caller actually gathers 
completion data, so that copy into user space can happen in the caller's
context. However this again holds up additional memory resources which may
not be suitable especially for large data transfers.
[BTW, on windows NT, iirc this sort of stuff happens through APCs or 
asynchronous procedure calls, in a very crude sense somewhat like softirqs 
running in the context of a specified task]

Instead, an approach similar to that taken with direct i/o has been adopted,
where the user space buffers are represented in terms of physical memory
descriptors (a list of tuples of the form <page, offset, len>), called kvecs,
rather than by virtual address, so that they are uniformly accessible in any 
process context. This required new in-kernel *kvec* interfaces which operate on 
this form of i/o currency or memory descriptors. Each entry/tuple in the kvec is
called a kveclet, and represents a contiguous area of physical memory. A
virtual address range or iovec (in the case readv/writev) would map to a set
of such tuples which makes up a kvec. 

Note: 
-----
  This fits in very nicely with the current multi-page bio implementation
  which also uses a similar vector representation, and also with the 
  zero-copy network code implementation. Ben has submiited some patches to 
  make this all a common data structure. 
  TBD: Some simple changes are needed in the multi-page bio code to get this      
  to work properly without requiring a copy of the descriptors.
  There is a discussion on various alternative representations that have 
  been considered in the past in sec 1.2.2 of:
	http://lse.sourceforge.net/io/bionotes.txt

The only possible drawback is that this approach does keep the user pages 
pinned in memory all the while until the i/o completes. However, it neatly
avoids the per-task service thread requirement of other aio implementations.

2.3 Extent of true async behaviour - Queue depth/Throttle points

There has been some discussion about the extent to which asynchronous 
behaviour should be supported in case the operation has to wait for some 
resource to become available (typically memory, or request queue slots). 
There obviously has to be some kind of throttling of requests by the system
beyond which it cannot take in any more asynchronous io for processing.
In such cases, it should return an error (as it does for non-blocking i/o)
indicating temporary resource unavailability (-EAGAIN), rather than block
waiting for resource (or could there be value in the the latter option ?).
It seems appropriate for these bounds to be determined by the aio queue depth 
and associated resource limits, rather than by other system resources (though
the allowable queue depth could be related to general resource availability).
This would mean that ideally, when one initiates an async i/o
operation, the operation gets queued without blocking anywhere, or returns
an error in the event it hits the aio resource limits.

[Note/TBD: This is the intended direction, but this aspect of the code is
still under construction and is not complete. Currently async raw aio would 
probably block if it needs to wait for request queue slots. Async file i/o 
attempts to avoid blocking the app due to sub i/os for bmap kind of operations 
but it currently could block waiting for the inode semaphore. The long term 
direction is to convert this wait to an async state driven mechanism. The 
async state machine also has to be extended to the waits for bmap operations 
which has so far only been pushed out of the app's context to that of the 
event thread that drives the next step of the state machine (which means 
that it could block keventd temporarily).]


2.4 Sequencing of aio operations

Specifying serialization restrictions or relative priorities:
- posix_synchronized_io (for multiple requests to the same fd)
  says that reads should see data written by requests preceding it - enforces
  ordering to that extent, if specified.
- aio_req_prio (not supported in the current implementation as yet)
  app can indicate some requests are lower priority than others, so the system
  can optimize system throughput and latency of other requests at the cost
  latency of such requests.

Some Notes: 
  - This feature should get linked to the priority based i/o scheduler when
    that goes in, in order to make sure that the i/os really get scheduled
    as per the priorities.
  - The priority of a request is specified relative to (and is lower than)
    the process priority, so it can't starve other process's requests etc when
    passed down to to the i/o scheduler for example. Besides the i/o scheduler
    would also have some kind of aging scheme of its own, or translate
    priorities to deadlines or latency estimates to handle things fairly.  
  - [TBD: Priorities typically indicate hints or expectations unlike 
     i/o barriers or synchronized i/o reqmts for strict ordering (except 
     possibly for real time applications ?) ]
  - Posix, says that the same priority requests to a character device should 
    be handled fifo. 
  - As John Myers suggested, considering priorities on the event delivery
    path in itself may be useful even without control on i/o scheduling. This
    aspect could possibly be implemented early, since it would be needed in 
    any case in the complete implementation to make sure that priorities are
    respected all the way through from initiation to completion processing.
    (See point 5 on prioritized event delivery under Sec 2.5)
  - To account for priorities at the intermediate steps in the async
    state machine, multiple priority task queues could be used instead of
    a single task queue to drive the steps.
   
Beyond these restrictions and hints, sequencing is up to the system, with
dual goals: 
- Maximize throughput (global decision)
- Minimize latency (local, for a request) 

There are inherent tradeoffs between the above, though improving system 
throughput could help with average latency, provided pipeline startup time 
isn't significant. A balanced objective could be to maximize throughput within 
reasonable latency bounds.

Since each operation may involve several steps which could potentially
run into temporary resource contention or availability delay points, the
sequence in which operations complete, or even reach the target device are
affected by system scheduling decisions in terms of resource acquisition 
at each of these stages.

Note/TBD: Since the current implementation uses event threads to drive 
stages of the async state machine, in situations where a sub-step isn't 
completely non-blocking (as desired), then the implementation ends up 
causing some degree of serialization, or rather further accentuating the 
order in which the requests reached the sub-step. This may be seem
reasonable and possibly even beneficial for operations that are likely 
to contend for the same resources (e.g requests to the same device), 
but not optimal for requests that can proceeed in a relatively independent 
fashion. The eventual objective is to make sure that sub-steps are 
indeed non-blocking, and there is a plan to introduce some debugging aids 
to help enforce this. As discussed in Section 2.3, things like bmap, 
wait for request, and inode semaphore acquisition are still to be converted 
to non-blocking steps (currently a todo).

2.5 Completion/Readiness notification:

Comment: Readiness notification can be treated as a completion of an 
asynchonous operation to await readiness.

POSIX aio provides for waiting for completion of a particular request, or
for an array of requests, either by means of polling, or asynchronously
through signals. On some operating systems, there is a notion
of an I/O Completion port (IOCP), which provides a flexible and scalable way
of grouping completion events. One can associate multiple file descriptors 
with such a completion port, so that all completion events for requests on
those files are sent to the completion port. The application can thus issue
a wait on the completion port in order to get notified of any completion
event for that group. The level of concurrency can be increased simply by
increasing the number of threads waiting on the completion port. There are
also certain additional concurrency control features that can be associated
with IOCPs (as on NT), where the system decides how many threads to 
wakeup when completion events occur, depending on the concurrency limits
set for the queue, and the actual number of runnable threads at that moment.
Keeping the number of runnable threads constant in this manner protects 
against blocking due to page faults and other operations that cannot be 
performed asynchronously.

On a similar note, the DAFS api spec incorportes completion groups for 
handling async i/o completion, the design being motivated by VI completion 
queues, NT IOCPs and the Solaris aiowait interfaces. Association of an 
i/o with a completion group (NULL would imply the default completion queue) 
happens at the time of i/o submission which lets the provider know where 
to place the event when it completes, contrary to aio_suspend style interface 
which specifies the grouping only when waiting on completion.

This implementation for Linux makes use a similar notion to provide
support for completion queues. There are api's to setup and destroy such
completion queues, specifying the maximum queue lengths that a queue is
configured for. Every asynchronous i/o request is associated with a completion 
queue when it is submitted (like the DAFS interfaces), and an application 
can issue a wait on a given queue to be notified of a completion event for 
any request associated with that queue.

BSD kqueue (Jonathan Lemon) provides a very generic method for registering 
for and handling notification of events or conditions based on the concept
of filters of different types. This covers a wide range of conditions 
including file/socket readiness notification (as in poll), directory/file
(vnode) change notifications, process create/exit/stop notifications, signal
notification, timer notification and also aio completion notification
(via SIGEV_EVENT). The kqueue is equivalent to a completion queue, and 
the interface allows one to both register for events and wait for (and
pick up) any events on the queue within the same call. It is rather flexible
in terms of providing for various kinds of event registration/notification 
requirements, e.g oneshot or everytime, temporary disabling, clearing 
state if transitions need to be notifiied, and it supports both edge and 
level triggered types of filters.

2.5.1 Some Requirements which are addressed:

1. Efficient for large numbers of events and connections
- The interface to register events to wait for should be separate from 
  the interface used to actually poll/wait for the registered events to 
  complete (unlike traditional poll/select), so that registrations can 
  hold across multiple poll waits with minimum user-kernel transfers.
  (It is better to handle this at interface definition level than 
   through some kind of an internal poll cache)

  The i/o submission routine takes a completion queue as a parameter,
  which associates/registers the events with a given completion group/queue. 
  The application can issue multiple waits on the completion queue using a 
  separate interface.

- Ability to reap many events together (unlike current sigtimedwait
  and sigwaitinfo interfaces)

  The interface used to wait for and retrieve events, can return an
  array of completed events rather than just a single event.

- Scalable/tunable queue limits - at least have a limit per queue rather
  than system wide limits

  Queue limits can be specified when creating a completion group.
  TBD: A control interface for changing queue parameters/limits (e.g
  io_queue_grow) might be useful

- Room for more flexible/tunable wakeup semantics for better concurrency
  control

  Since the core event queue can be separated from the notification mechanism, 
  the design allows one to provide for alternative wakeup semantics
  to optimize concurrency and reduce redundant or under-utilized context
  switches. Implementing these might require some additional parameters or
  interfaces to be defined. BTW, it is desirable to provide a unified interface
  for notification and event retrieval to a caller, to avoid synchronization
  complexities, even if the core policies are separable underneath in-kernel.

  [See the discussion in Sec 2.6 on wakeup policies for a more
  detailed discussion on this]

2. Enable flexible grouping of operations 	
- Flexible grouping at the time of i/o submission 
  (different operations on the same fd can belong to different groups,
  operations on different fds can belong to the same group)

- Ability to wait for at least a specified number of operations from 
  a specified group to complete (at least N vs at least 1 helps with 
  batching on the way up, so that the application can perform its post
  processing activities in a batch, without redundant context switches)

  The DAFS api supports such a notion, both in its cg_batch_wait interface
  which returns when either N events have completed, or with less than N
  events in case of a timeout, and also in the form of a num_completions 
  hint at the time of i/o submission. The latter is a hint that gets sent
  out to the server as a characteristic of the completion queue or session,
  so the server can use this hint to batch its responses accordingly.
  Knowing that the caller is interested only in batch completions helps
  with appropriate optimizations.

  Note: The Linux aio implementation today only supports "at least one" 
  and not "at least N" (e.g the aio_nwait interface on AIX). 

  The tradeoffs between responsiveness and fairness issues tend to 
  to get amplified when considering "at least N" type of semantics, 
  and this is one of the main concerns in supporting it. 
  [See discussion on wakeup policies later]

- Support dynamic additions to the group rather than a static or one time
  list passed through a single call 

  Multiple i/o submissions can specify the same completion group, enabling
  events to be added to the group.

  [Question: Is the option of the completion group being different from the 
  submission batch/group (i.e. per iocb grouping field) useful to have ?
  POSIX allows this]

3. Should also be able to wait for a specific operation to complete (without
   being very inefficient about it)

  One could either have low overhead group setup/teardown so such an operation 
  may be assigned a group of its own (costs can be amortized across multiple
  such operations by reusing the same group if possible) or provide an 
  interface to wait for a specific operation to complete.

  The latter would be more useful, though it requires a per-request wait queue 
  or something similar. The current implementation has a syscall interface 
  defined for this (io_wait), which hasn't been coded up as yet. The plan is 
  to use hashed wait queues to conserve on space. 

  There are also some semantics issues in terms of possibilities of another
  waiter on the queue picking up the corresponding completion event for this
  operation. To address this, the io_wait interface might be modified to
  include an argument for the returned event.

  BTW, there is an option of dealing with this using the group primitives 
  either in user space, or even in kernel by waiting in a loop for any event 
  in the group until the desired event occurs, but this could involve some 
  extra interim wakeups / context switches under the covers, and a user
  level event distribution mechanism for the other events picked up in the
  meantime.

4. Enable Flexible distribution of responsibility across multiple 
   threads/components

  Different threads can handle submission for different operations,
  and another pool of threads could wait on completion.
  The degree of concurrency can be improved simply by increasing threads
  in the pool that wait for and process completion of operations for 
  that group.

5. Support for Prioritized Event Delivery

   This involves the basic infrastructure to be able to accord higher
   priority to the delivery of certain completion events over others,
   (e.g. depending on the request priority settings of the corresponding 
   request), i.e. if multiple completion events have arrived on the
   queue, then the events for higher priorities should be picked up
   first by the application.  

   TBD/Todo:
   One way of implementing this would be to have separate queues
   for different priorities and attempt to build an aggregate (virtual)
   queue. There are some design issues to be considered here as in any 
   scheduling logic, and this needs to be looked at in totality in 
   conjunction with some of the other requirements. For example, things 
   like aging of events on the queue, could get a little complex to do.

   One of the approaches under consideration is to try to handle the 
   interpretation of priorities in userspace, leaving some such decisions
   to the application. It is the application which decides the limits 
   for each of the queues, so the kernel avoids having to handle that 
   or balance space across the queues. Only kernel support for making 
   a multiplexed wait on a group of completion queues possible
   might suffice to get this to work. Ben has in a mind a rather generic
   way of doing this (across not just completion queues, but also possibly
   across other sorts of waits) by providing primitives that expose the 
   richness of the kernel's wait queue interfaces directly to userspace.

   The idea is that something like the following would become possible:
 
 	user_wait_queue_t wait;
 	int ret;
 
 	add_wait_queue(high_pri_ctx, wait)
 	add_wait_queue(low_pri_ctx, wait)
 
 	ret = process_wait();	/* call it schedule() if you want */
 
 	while (vsys_getevents(high_pri_ctx...) > 0)
 		...
 	...
 
   ie, a very similar interface to what the kernel uses which can be mixed 
   and matched across the different kinds of things that need to be waited 
   upon (locks, io completion, etc). Such a mechanism can also be used for 
   building the more complex locks that glibc needs to provide 
   efficiently without sacrificing a rich and simple interface.

   Notice, that for true aio_req_prio, the kernel would have to
   be aware of completion queue priorities, but that it may still be
   possible for the order in which events are picked up (across the
   queues) to be handled by the application.

   BTW, another possibility is to maintain a userland queue (or set of 
   queues, for each priority), into which events get drawn in whenever 
   events are requested and then later distributed/picked up by the 
   application's threads. One of the tricky issues with such multi-level 
   queues is handling flow control, which is not very appealing.

   (Interestingly Viro's suggested interface (3.2) also deals with composite 
    queues. Just one level of aggregation suffices for the prioritized
    delivery requirements, while Viro's interface supports multiple
    levels of aggregation. )


2.6 Wakeup Policies for Event Notification

2.6.1 The wakeup policy used in this implementation

The design is geared towards minimizing the latency of completion 
processing which directly related to the responsiveness of the 
(server) application to events. Ensuring fairness (or even starvation 
avoidance) is not expected to be an issue with the expected application 
model of symmetric server threads (i.e. threads which take the same actions 
on completion of given events), except in so far as it affects load 
balancing which in turn could affect latency. 

[TBD: I'm not sure of this, but starvation may be an issue when it
comes to non-symmetric threads, where the event is a readiness
indicator which the thread uses to decide on availability of space 
in order to push its data or something of that sort.]

The wakeup policy in the current implementation is to wakeup a thread
on the completion queue whenever an i/o completes. Any thread who picks up
the event first (this could even be a new caller who wasn't already waiting 
on the queue) gets it, and if no events are available for a thread to pick
up, it goes back to sleep again. By ensuring that the thread who gets to
the event earlist picks it up, this keeps the latency minimal. Also in view of 
better cache utilization the wake queue mechanism is LIFO by default. 
(A new exclusive LIFO wakeup option has been introduced for this purpose)
Making the wakeups exclusive reduces some contention or spurious wakeups.

When events are not coming in rapidly enough for all the threads to get 
enough events that they can use up their full time slice processing, there 
is a likelihood of some contention and redundant or rather underutilized 
context switches. (While it just might happen that a thread is gets deprived 
of events as other threads keep picking them up instead, as discussed, that 
may not be significant, and probably just an indicator that the number of 
threads could be reduced. )

In the situation when there are a lot of events in the queue, then every
thread tries to pick up as many events as it can (upto the number specified
by the caller), but one at a time. The latter aspect (of holding the lock
across the acquisition of only a single event at a time) helps with some amount
of load balancing (for event distribution, or completion work) on SMP when 
these threads are running parallely multiple CPUs.

2.6.2 TBD: Note on at least N semantics:

In some situations where an application is interested in batch results
and overall throughput (vs responsiveness to individual events), an "at
least N" kind of wakeup semantics, vs "at least one" can help amortize the cost
of a wakeup/context switch across multiple completions. (This is better than
just a time based sleep which doesn't have any co-relation with the i/o
or event completion rates - one could have too many events building up or
perhaps too little depending on the load). This makes sense when the amount
of post-processing on receipt of an event is very small and the resulting 
latency is tolerable (combination of timeout +  N lets one specify the
bounds), so the application would rather receive notifications in batches.

Things get a little tricky when trying to define the policies for "at
least N" when multiple threads are involved, possibly with different values
of N (though that is not a typical situation), in terms of event distribution,
simply because the tradeoffs between latency and fairness tend to widen in
this case. 

A natural extension of the current scheme to an at least N scheme, 
would be to wake up only waiters whose "N-value" matches or exceeds the 
number of events available, and then have them try to pick up their N events
in one shot (i.e. atomically) if available or go back to sleep. If a thread
finds more events available after it picks up its N events, or after it
times out, then just as before it keeps picking up as many events as it can 
(upto the specified limit) but one at time. This helps reduce the load 
balancing vs batching conflict (the policy is batch upto n, balance beyond 
that). 
[TBD: Implementing the "check for N and wakeup" scheme above correctly in
 the presence of exclusive waits may require support in the wait queue
 wakeup logic to account for the status returned by a wait queue function
 to decide if the entry should be treated as done/woken up. The approach
 would be that the earliest waiter whose conditions are satisfied would
 get woken up]

Obviously the possibility of starvation is relatively more glaring 
in this case, than with at-least-one, e.g. consider the case when 2N-1 events 
are just picked up by one thread, while the other thread is idle, and the 
2Nth event comes in just then. As mentioned earlier, starvation is not an 
issue in itself, but the load balancing implication is worth keeping in 
mind. The maximum number of events requested in one shot and the timeout 
provide the bounds on this sort of a thing from an application's 
perspective. (BTW, The DAFS cg_batch_wait interface is "exactly N", which 
is one other way of handling this; actually it is exactly N or less on a
timeout)

Notice that trying to implement at-least-N semantics purely in user space 
above at-least-one primitives with multiple waits has latency issues in the
multiple waiters case (besides the extra wakeups/context switches). In the
worst case, with m threads, the latency for actual completion processing
(where completion processing happens in batches of N events), could be 
delayed upto the arrival of the m*N-1 th event.

Remark: "At least N" is still a TBD. 

2.6.3 TBD: Load/Queue Length based wakeup semantics:

This is another option, from a networking analogy, where the system could
tune the N-value for wakeup on a queue, based on event rates or space 
available to queue more requests. This is however based on the expectation
that completion processing would trigger a fresh batch of aio requests on the
queue.

Note: Being able to wait on a specific aio, or a submit and wait
for all the submitted events to complete (the way it is supported in 
BSD kqueues) are other interfaces that could potentially reduce 
the number of context switches, and are useful in some situations
(no implemented as yet).


2.6.3 TBD/Future: Per Completion Queue Concurrency Control

There have been some thoughts about achieving IOCP concurrency control 
via associated scheduling group definitions, independently of aio completion 
queue semantics, so an application could possibly choose to use both aio 
and scheduling groups together. 

This might make sense because the system has no persistent association of 
the completion queue with threads that aren't waiting on that queue.
Implicit grouping (e.g association with of a thread with the last ctx it
invoked io_getevents on) is possible, but does make some assumptions 
(even if these might reflect the most typical cases) on the way the 
application threads handle completions and IOCP waits. 

On the other hand as John Myers indicated, a pure scheduling group feature 
that only looks at wakeups, without knowledge of the reason for the wakeup 
(the ability to distinguish between more events/work coming in which can 
be handled by any one from the set of threads, or indicating completion 
of synchronous actions meant for a specific thread) may not be able to 
take the kind of more informed decisions that a more tightly coupled 
feature or abstraction that operates at a slightly higher level can.

One way to solve this would be for the scheduling group implementation
(if and when it is implemented) to also allow for (in-kernel) priority 
indicators for waiters (or the wait queues, whichever seems appropriate), 
so that it can handle such decisions. Components like aio could take care of 
setting up such priorities as it sees fit (e.g accord lower priorities for 
the completion wait queue waits) to cause the desired behaviour.


2.7 Other Goals

- Support POSIX as well as completion port style interfaces/wrappers
  The base kernel interfaces are deisgned to provide the minimum native support
  required for the library to implement both styles of interfaces, 
- Low overhead for kernel and user 
  [Potential todos: Possibly via an mmaped ring buffer, vsyscalls]
- Extensible to newer operations  
  e.g. aio_sendfile, aio_readv/writev + anything else that seems
  useful in the future (semaphores, notifications etc ?)

3. Interfaces 

3.1 The interfaces provided by this implementation

The interfaces are based on a new set of system calls for aio:

- Create/Setup a new completion context/queue. This completion context 
  can only be shared across tasks that share the same mm (i.e. threads).
	__io_setup(int maxevents, io_context_t *ctxp)

- Submit an aio operation. The iocb describes the kind of operation and
  the associated paramters. The completion queue to associate the
  operations with is specified too.
	__io_submit(io_context_t ctx, int nr, struct iocb **iocbs)

- Retrieve completion events for operations associated with the completion
  queue. If no events are present, then wait for upto the timeout for at
  least one event to arrive.
	__io_getevents(io_context_t ctx, int nr, struct io_event *events,
		struct timespace *timeout)

- Wait upto the timeout for the i/o described by the specific iocb to complete.
  [Ques: Should this interface be modified to retreive the event as well ?]
	--io_wait(io_context_t ctx, struct iocb *iocb, 
		struct timespec *timeout)

- Cancel the operation described by the specified iocb. 
 	__io_cancel(io_context_t ctx, void *iocb)

- Teardown/Destroy a new completion context/queue (happens by default upon
  process/mm exit)
  Pending requests would be cancelled if possible, and the resources would
  get cleaned up when all in-flight requests get completed/cancelled.
  Naturally any unclaimed events would automatically be lost.
	__io_destroy(io_context_t ctx)

The library interface that a user sees is built on top of the above system
calls. It also provides a mechanism to associate callback routines with the 
iocb's which are invoked in user space as part of an event processing loop 
when the corresponding event arrives. There are helper routines (io_prep_read,
io_prep_write, io_prep_poll, etc) which can be used for filling up an iocb 
structure for a given operation, before passing it down through io_submit.

Please refer to the aio man pages for details on the interfaces and how
to use them [Todo: Reference to man pages fron Ben]

POSIX aio is implemented in user space library code over these basic
system calls. This involves some amount of book-keeping and extra threads
to handle some of the notification work (e.g. SIGEV_NOTIFY is handled by
sending the notification signals to the kernel from the user space thread).
(Note: The plan is to add support for direct signal delivery from the 
kernel for aio requests in which case this dependence on pthreads would
change)

3.2 Extending the Interfaces

Alternatives to using system calls for some of the aio interfaces,
particularly the event polling/retreival pieces include implementing a 
pseudo device driver interface (like the /dev/poll and /dev/epoll 
approaches), or a pseudo file system interface. A system call approach 
appears to be a more direct and clear-cut interface than any specialized 
device driver ioctl or read/write operations approach, which was one of the 
reasons why the possibility of a /dev/aio was abandoned during aio 
development.

TBD/Future:
The filesystem namespace based approach that Al Viro has suggested for
i/o multiplexors (for flexible and scalable event polling/notification), 
provides for some interesting features over aio completion queues 
like naming, sharing (across processes rather than just threads),
access control, persistence, and hierarchial grouping (i.e. more than 
just a single level of grouping). The model uses AF_UNIX socket 
sendmsg/recvmsg calls with specific datagram formats (SCM_RIGHTS datagrams)
on the namespace objects instead of any new apis for registration and polling
of events. The interface is defined so that recvmsg gets a set of new
open descriptors for each of the underlying channels with events. This
makes it feasible to share event registrations across processes, since the
fd used to register the event needn't be available when the event is 
picked up. 

However, it still would make sense to have a separate mechanism for 
async i/o and associated notifications. Possibly if something like the 
above is implemented, one could consider ways of associating aio completion
queues with it, if that fits sematically, or move things like async
poll out of aio in there. Most of the aio operations (other
than async poll today, and possibly aio_sendfile later) involve user
space buffers, so sharing across processes may not make much sense, except
perhaps in the case of shared memory buffers.

4. Design Internals

4.1 Low Level Primitives :

4.1.1 wait_queue functions

This primitive is based on an extension to the existing wait queue scheme.
The idea is that both asynchronous and synchronous waiters just use the 
same wait queue associated with any given data structure transparent 
to the caller of wakeup. (This avoids the need to attach new notify/fasync
sort of structures for every relevant operation/data structure involved 
in the async state machine)

To support asynchronous waiters, the wait queue entry structure now 
contains a function pointer for the callback to be invoked for async 
notification. The default action, in case such a callback is not specified, 
is to assume that the entry corresponds to a synchronous waiter (as before) 
and to wake it up accordingly. The callback runs with interrupts disabled 
and with the internal wait queue spinlock held, so the amount of work done 
in the callback is expected to be very restricted. Additional spinlocks 
should be avoided. The right thing to do if more processing is required 
is to queue up a work-to-do action to be run in the context of an event 
thread (see next section). Extreme caution is recommended in using wait 
queue callbacks as it is rather prone to races if not used with care.

There is a routine to atomically check for a condition and add a wait queue
entry if the condition is not met (add_wait_queue_cond). The check for the 
condition happens with the internal wait queue spin lock held. This avoids 
missing events between the check and addition to the wait queue, which could 
be fatal for the async state machine. The standard way of handling the 
possibility of missed events with synchronous waiters was to add the wait 
queue entry before performing the check for the condition, and to just
silently remove the entry thereafter if the condition has already been 
met. However in the case of async waiters where the follow on action happens 
in the wait queue function, this could lead to duplicate event detection, 
which could be a problem if the follow on action is not defined to be 
idempotent. The add_wait_queue_cond() feature helps guard against this.
[Note: An associated implication of this is that checks for wait_queue_active
outside of the internal wait queue lock are no longer appropriate, as it 
could lead to a missed event]

The wait queue callback should check for the occurance of the condition
before taking action just as in typical condition wait/signal scenarios.
Notice that the callback is responsible for pulling the entry off the wait
queue once it has been successfully signalled, unlike the synchronous case
where queueing and dequeueing happens in the same context.

4.1.2 Work-to-dos (wtd) for async state machine

Work-to-dos provide the basic abstraction for representing actions for 
driving the async state machine through all the steps needed to complete
an async i/o operation. The design of work-to-dos in this aio implementation
is based on suggestions from Jeff Merkey for implementing such async state
machines, and is modelled on the approach taken in Ingo Molnar's 
implementation of TUX.

As mentioned in the previous section, because of restricted conditions 
under which wait queue functions are called, it isn't always possible to 
drive steps of the async state machine purely through wait queue functions. 
Instead the wait queue function in turn could queue a work-to-do action 
to be invoked in a more suitable context, typically by a system worker thread.
This is achieved using the task-queue primitives on Linux. Currently aio
just uses the same task queue which is serviced by keventd (i.e. the context 
task queue). In the future this could possibly be handled by a pool of 
dedicated aio system worker threads. [TBD: Also, priorities may be supported
by having multiple task queues of different priority]


struct wtd_stack {
	void	(*fn)(void *data);  /* action */
	void	*data; /* context data for the action */
};

struct worktodo {
	wait_queue_t		wait;  /* this gets linked to the wait
					queue for the event which is
					expected to trigger/schedule this
					wtd */
	struct tq_struct	tq;    /* this gets linked to the task
					  queue on which the wtd has to be
					  scheduled (context_tq today) */

	void			*data;	/* for use by the wtd_ primatives */

	/* The stack of actions */
	int			sp;  
	struct wtd_stack	stack[3];
};


A typical pattern for the asynchronous version corresponding to a 
synchronous operation consisting a set of non-blocking steps with 
synchronous waits between steps, could be something like the following:
(Lets label this pattern as A)
- Initiate step 1, and register an async waiter or callback
- Async waiter completes and queues a work-to-do for the next step
- The work-to-do initiates step 2 when it gets serviced, and registers an
  async waiter or callback to catch completion
- Async waiter/callback initiates step 3 ... 

.. and so on till step n.

Of course, there are other possible patterns, e.g where the operation can be
split off into multiple independent sub-steps which can be initiated at the
same time, and then use callbacks/async waiters to collect/consolidate the
results and if required queue a work-to-do action after that to drive the
follow up action. (Lets label this pattern B)

The work-to-do structure is designed so that state information can be passed
along from one step to the next (unlike synchronous operations, state 
can't be carried over on stack in this case). There is also support for 
stacking actions within the same work-to-do structure. This feature has
been used in the network aio implementation (which is currently under a
revamp) to enable calling routines to stack their post completion actions 
(and associated data) before invoking a routine that might involve an async 
wait. For example, consider a nested construct of the form:

func1()
{
	func2();
	post_process1();
}

func2()
{
	func3();
	post_process2();
}

func3()
{
	process;
	wait for completion;
	post_process3();
}

The asynchronous version of the above could have the following pattern, 
assuming that a worktodo structure is shared/passed on in some manner down 
the levels of nesting:

- func1 initializes the worktodo with the action post_process1(), before
  calling func2
- func2 pushes the action post_process2() on the worktodo stack before
  calling func3
- func3 pushes the action post_process3() on the worktodo stack
- func3 then replaces its synchronous wait by setting up an asynchronous 
  waiter which would schedule the worktodo sequence
- the worktodo sequence simply pops each action by turn and exceutes it 
  to achieve the desired effect. 
(Lets label this pattern C)

Some caution is needed when using the async waiter + work-to-do combination,
e.g maintaining the 1-1 association with an event and the queueing of the
worktodo, and guards against duplication or event misses (as discussed in the
previous section). Also, one needs to be very careful about recursions in
the chained operations (can't have stack overflows in the kernel).

4.2 Generic async event handling pieces

4.2.1 The completion queue

The in kernel representation of the completion queue structure (kioctx),
contains a list of in-use (active) and free requests (where each request
in the in-kernel iocb representation, i.e. kiocb), and also a circular
ring buffer, where completion events are queued up as they arrive, and
picked up in FIFO order. There is a per kioctx wait queue which is used to 
wait for events on that queue. The reference count of a kioctx is incremented
when it is in use (i.e. when there are pending requests).

A completion queue is associated with the mm struct for the concerned task,
thus threads which share the same address space also share completion
queues. The ctx_id is unique per-mm. The completion queues for a given
address space are linked together with the list grounded in the mm struct.
On process exit (i.e. when the mm users count goes to zero), the completion 
queue is released (the actual free could happen a little later depending on 
the reference count, i.e. in case the kioctx is in use).

The ring buffer is designed to be virtually contiguous, so if necessary
(i.e. if the higher order page allocation needed to accomdate the specified
number of events fails) it may be vmalloc'ed. The requests/kiocbs are 
also preallocated when the kioctx is created, but these needn't be contigous
and are allocated from slab.


4.2.2 I/O Submission, Completion and Event Pickup

New requests can be submitted only if there is enough space left in the
ring buffer to accomodate completion events for all pending requests as 
well as the new one in the ring buffer.

The io_submit interface invokes the corresponding async file op based on
the operation code specified in the iocb. The file descriptor reference
count is incremented to protect against the  case the process exits and 
closes the file while i/o is still in progress. In such a scenario the 
request, file descriptor and the kioctx state are not freed immediately, 
but in a deferred manner as and when the completions (or cancellation 
possibly once that is supported) happen, and it is safe to do so. 

When the operation completes, the corresponding completion path (via
async waiters or worktodos) invokes aio_complete to takes care of
queuing the completion status/event at the end the ring buffer, 
waking up any threads that may be waiting for events on the queue, 
releasing the request and other related cleanups (e.g decrementing the
file descriptor reference count).

When the io_getevents interface is invoked for harvesting events, it picks
up completion events available in the circular ring buffer (i.e. from
the head of queue), or waits for events to come in, depending on the wakeup
and event distribution policies discussed in Sec 2.6.

4.2.3 TBD: User space memory mapping of the Ring Buffer

The design allows for the possibility of modifying the implementation to
allow for the events ring buffer to be mapped in user space, if that helps
with performance (avoiding some memory copies and system call overheads).
The current implementation prepares for avoiding the complexities of 
user-kernel locking in such a case by making sure that only one side 
updates any field (basically head and tail of the ring buffer), and also
banking on the assumption that reading an old value won't cause any real
harm. The Kernel/Producer updates the tail and the User/Consumer updates 
the head.  If the User sees an old value of tail, it may not see some just 
arrived events, which is similar to the case when the events haven't 
arrived, and so harmless. If the Kernel sees an old value of Head, then it 
may think there isn't enough space in the queue and will try again later. 

TBD: As Andi Kleen observed, schemes like this could be rather fragile and 
hard to change as past experience with such optimizations in the networking 
code have indicated where proper spin locks had to be added eventually. 
So we need to understand how significant a performance benefit is acheived 
by moving to a user space mapped ring buffer to decide if it is worth it.

4.3 In-kernel interfaces 

4.3.1 Operations 

The in-kernel interfaces that were added for aio implementation include
the following:

- New filesystem operations (fops) to support asynchronous read/write
  functions (f_op->aio_read/write)
- Several helper routines for manipulating and operating on kvecs, the common 
  i/o currency discussed in Sec 2.2.2 (e.g. mapping user space buffers to
  kvecs, copying data to/from/across kvecs i.e. *kvec_dst* routines)
- New filesystem read/write operations (fop->kvec_read/write) which 
  operate directly on kvec_cb's in asynchronous mode. These are the 
  operations that have to be defined for different file types, e.g 
  raw aio, buffered filesystem aio and network aio. 

The f_op->read/write operations are expected to be changed to support an
F_ATOMIC flag which can be used to service an aio operation synchronously
if it can be done in a non-blocking manner.This provides a fast path that
avoids some of the overheads of async state machine when the operation can
complete without blocking/waiting at all. Currently, F_ATOMIC is implemented
via f_op->new_read/write for compatibility reasons. 
[Todo: The plan is to add f_ops->flags_supported to enable read/write 
 operations to be converted wholesale with requiring additional code to
 check for supported operations in all callees.]

The generic f_op->aio_read/write operations, first attempt the non-blocking
synchronous path described above, and take the async route only if it fails
(i.e. returns an error indicating that the operation might block). In that
case they convert the user virtual address range to a kvec and then invoke 
the appropriate async kvec fops. Notice that this mechanism should be 
extendable to readv/writev in a relatively lightweight manner (compared to 
kiobufs), though aio readv/writev is still a Todo right now.

4.3.2 The i/o Container Data structure, kvec_cb

The i/o unit which is passed around to the kvec fops is the kvec_cb structure,
This contains a pointer to the kvec array discussed earlier plus associated 
callback state (i.e. callback routine and data pointer) for i/o completion.

struct kveclet {
	struct page	*page;
	unsigned	offset;
	unsigned	length;
};

struct kvec {
	unsigned	max_nr;
	unsigned	nr;
	struct kveclet	veclet[0];
};

struct kvec_cb {
	struct kvec	*vec;
	void		(*fn)(void *data, struct kvec *vec, ssize_t res);
	void		*data;
};

struct kvec_cb_list {
	struct list_head	list;
	struct kvec_cb		cb;
};

The callback routine would typically be set to invoke aio_complete for
performing completion notification. For a compound operation like aio_sendfile,
which involves two i/os (input on one fd and output to the other), the 
callback could be used for driving the next stage of processing, i.e. to 
initiate the second i/o.

[TBD: With this framework, callback chaining is not inherently supported.
 Intermediate layers could save pointers to higher layer callbacks as part
 of their callback data, and thus implement chaining themselves, but a 
 standard mechanism would be preferable. ]

The *kvec_dst* helper routines which are used for retrieving or transfering
data from/to kvecs are designed to accept as argument a context structure
(kvec_dst) to maintain state related to the remaining portions to transfer. 
Since a kvec contains fragments of non-uniform size, locating the portion 
to transfer given the offset in number of bytes from the start of the kvec 
is not a single step calculation, so its more efficient to maintain this 
information as part of the context structure. These routines also take 
care of performing temporary kmaps of veclets for memory copy operations,
as needed. 

The map_user_kvec() routine is used to map a user space buffer to a kvec
structure (it allocates the required number of veclet entries). It also
takes care of bringing the corresponding physical pages if they are swapped 
out. It increases the reference count of the page, essentially pinning it 
in memory for the duration of the io. 
(TBD/Check: Where does unmap_kvec happen ?)


4.4 Async poll

Async poll enables applications to make use of the advantages of aio 
completion queues for readiness notification, avoiding some of the 
scalability limitations and quirks of traditional poll/select. Instead of
passing in an array of <fd, event> pairs, one prepares iocbs corresponding
to each such <fd, event> pair, and then submits these iocbs using io_submit
associating them with a completion queue. Notifications can now be obtained
by waiting for events on the completion queue. Unlike select/poll, one does
not need to rebuild the event set for every iteration of the event loop; the
application just has to resubmit iocbs for the events it has already reaped, 
in case it needs to include them in the set again for the next poll wait.

The implementation is a simple extension of the existing poll/select code,
which associates an iocb with a poll table structure and replaces the 
synchronous wait on a poll table entry by an asynchronous completion 
sequence (using a wait queue function + worktodo construct) that issues 
aio_complete for the corresponding iocb thus affecting the notification. 


4.5 Raw-disk aio

The internal async kvec f_ops for raw disk i/o are implemented along the 
lines of pattern B discussed in Sec 4.1.2. The common raw_rw_kvec routine
invokes brw_kvec_async, which shoots out all the i/o pieces to the low level
block layer, and sets up the block i/o completion callbacks to take care 
of invoking the kvec_cb callback when all the pieces are done. The kvec_cb
callback takes care of issuing aio_complete for completion notification.

TBD/Todo:
There is one problem with the implementation today, in that if the 
submit_bh/bio operation used by brw_kvec_async blocks waiting for 
request queue slots to become free, then it blocks the caller, so the 
operation wouldn't be truly async in that case. Fixing this is one of
items in the current Todo list. For example, instead of the synchronous 
request wait, a non-blocking option supplemented with an async waiter for 
request queue slots, which in turn drives the corresponding i/o 
once requests are available, using state machine steps along the lines 
employed for file i/o, could be considered. 

[Note/Todo: In the aio patches based off 2.4, brw_kvec_async sets up
buffer heads and keeps track of the io_count and the list of bhs (in 
a brw_cb structure, which also embeds the kvec_cb structure) in order 
to determine when all the pieces are done. In 2.5, it would allocate 
a bio struct to represent the entire i/o unless the size exceeds the 
maximum request size allowed, in which case multiple bios may need to be 
allocated. The bio struct could be set up to directly point to the veclet 
list in the kvec, avoiding the need to copy/translate descriptors in the 
process]

4.6 File-system/buffered aio

The generic file kvec f_ops (generic_file_kvec_read/write), for buffered
i/o on filesystems employ a state machine that can be considered close to
pattern A (with a mix of pattern B) discussed in Sec 4.1.2. The state 
information required through all the iterative steps of this state machine 
is maintained in an iodesc structure that is setup in the beginning, and 
passed along as context data for the worktodo actions. 

The operation first maps the page cache pages corresponding to the specified
range. These would form the source/target of the i/o operation. It maintains
a list of these pages, as well as the kvec information representing the user
buffer from/to which the transfer has to happen, as part of the iodesc 
structure, together with pointers or state information describing how much
of the transfer has completed (i/o to/from the page cache pages, and the
memcopy to/from the user buffer veclets). In case of read, the post
processing action for completion of i/o on a particular page would involve
copying the data into the user space buffer, while for write, the copy
from the user space buffer to the page happens early before committing the
writeout of the page (i.e. between prepare_write and commit_write).

Notice that the potential blocking points down the typical read/write path 
involve:

(a) Waiting to acquire locks on the concerned pages (page cache pages 
    corresponding to the range where i/o is requested) before starting i/o
(b) Waiting for the io to complete:
    - for read, this involves waiting for the page locks again 
      (indicative that the page lock has been released after i/o 
      completion), and then checking if the page is now uptodate
    - for write (O_SYNC case), this involves waiting for the page
      buffers, i.e. waiting for the writeout to complete. 
      [TBD: Currently its really only O_DSYNC, and not meta-data sync that's
       affected ]

Each of these waits has been converted to an async wait (wtd_wait_on_page
and wtd_wait_on_buffer) operation, that triggers the next step of the i/o
(i.e as in pattern A). Notice that this becomes multi-step when the 
i/o involves multiple pages and any of lock acquisitions is expected to 
require a wait. Some speedup is achieved by initiating as much work as 
possible, e.g. initiating as many readpage operations as possibly early on 
the readpath, and initiating all the writeouts together down the write path 
before waiting for completion of any (this is where the resemblance to 
pattern B comes in).

Currently the filesystems modified to support aio include ext2, ext3 and 
nfs (nfs kvec f_ops internally make use of the generic_file_kvec* operations, 
after calling nfs_revalidate_inode).

[Note/Todo: There is still some work to do to make the steps non-blocking.
 The bmap/extent determination operations performed by the filesystem 
 are blocking, and the acquisition of the inode semaphore also needs to 
 be converted to a wtd based operation]

4.7 Network aio

[Todo: To be added later since the code is under a rewrite - pattern C 
in 4.1.2 ? ]

4.8 Extending aio to other operations (e.g sendfile)

[Todo/Plan: 
 The idea here is to make use of the kvec callbacks to kick the operation
 into the next state, i.e. on completion of input from the source fd, trigger
 the i/o to the output fd. 
]

5. Performance Characteristics

[Todo: Research/Inputs required]

6. Todo Items/Pending Issues 

- aio fsync 
- aio sendfile 
- direct aio path (reorder vfs paths to have a single rw_kvec interface from
  fs when it really needs to do i/o)
- aio readv/writev
- i/o cancellation implementation (best effort; cancel i/os on process exit ?)
- io_wait implementation (needs hashed waitqueues)
- check for any races in current filesystem implementation (?)
- implementations for other filesystems 
- network aio rewrite
- in-kernel signal delivery mechanism for aio requests
- making sub-tasks truly async (waiting for request slots, bmap calls)
- debugging aids to help detect drivers which aren't totally async (e.g 
  use semaphores - need to check which) or other sub-tasks which aren't 
  truly async
- flow control in aio (address write throtting issue)
- implementing io_queue_grow (changing queue lengths)
- mmaped ring buffer (Could lockless approaches be more fragile than
  we forsee now ? Is it worth it ? How much does it save ?)
- kernel memory pinning issue (pinning user buffers too early ? may be
  able to improve this with cross-memory descriptors once aio flow control
  is in place)
- explore at-least-N
- explore io_submit_wait
- aio request priorities (get the basic scheme in place, later relate it
  to the priority based i/o scheduler when that happens)
- user space grouping of multiple completion queues (handling priorities,
  concurrency control etc; expose wait-queue primitives to userspace)
- interfacing with generic event namespace (pollfs) approach  (viro's idea)

7. References/Related patches:

1. Dan Kegel's c10k site: (http://www.kegel.com/c10k.html)
   Talks about the /dev/epoll patch, RT signals, Signal-per-fd approach,
   BSD kqueues and lots of links and discussions on various programming 
   models for handling large numbers of clients/connections, with
   comparative studies. 
2. NT I/O completion ports, Solaris and AIX aio, POSIX aio specs 
3. SGI's kaio implementation - (http://oss.sgi.com/projects/kaio)
4. Block Asynchronous I/O: A Flexible Infrastructure for User Level Filesystems
   - Muthian Sivathanu, Venkateshwaran Venkataramani, and Remzi H. 
     Arapaci-Dusseau, Univ of Winsconsin-Madison 
   (http://www.cs.wisc.edu/~muthian/baio-paper.pdf)
5. The Direct Access File System Protocol & API Specifications 
   - DAFS Collaborative (http://www.dafscollaborative.org)
6. 2.5 block i/o design notes - (http://lse.sourceforge.net/io/bionotes.txt)