Andres Kroonmaa wrote:
> I don't know the details. I imagine that signal queueing to the proc
> is very lightweight. Its the setting up and dequeueing that burns the
> benefits imho. Mostly because its done 1 FD at a time. I can't imagine
> any benefits compared to devpoll, unless you need to handle IO from
> signal handler.
dequeueing is also relatively lightweight.. there is no schedule point
if you do it non blocking or if there is events pending. And the scheme
can very easily be extended to return all pending events at once.
However, independent measurements have shown that the CPU burn
difference between returning one at a time and all at one is quite
minimal for a normal but quite optimized application. Exactly why I
don't know as I don't know the exact measurements being made.
To get some perspective on the time involved here, Linux syscalls
measure as having slightly more than one order extra overhead compared
to a plain functioncall, or about 0.7 usec my laptop. This is exactly
(within +-5%) of the time of a 1K memcopy.
(Pentium III mobile, 450 MHz, 256 Kb cache)
It does seem to be true that the faster the CPU, the higher the
proportional syscall overhead, but I do not have any conclusive
measurements on this (too many different variables between my two
computers, different kernel version and compiler for one thing)
> Not sure what you mean by light context switches. Perhaps you make
> distinction between CPU protmode change, kernel doing queueing and
> kernel going through the scheduling stuff.
I do.
> This all is sensed only when syscall rate is very high, when code
> leaves process very often doing only very little work at a time in
> either kernel or userspace. We should stay longer in userspace,
> preparing several sockets for IO, and then stay longer in kernel,
> handling IO.
This reasoning I do not buy. The "overhead" of the syscalls should be
linear to the amount of processing you are doing. There is not less
amount of syscall per unit of work done in total at low rates than there
is at high rates. But I buy that the more optimized the code is, the
more noticeable the per-syscall overhead will be unless you optimize
these in the same level.
> Imagine we had to loop through all FD's in poll_array and poll()
> each FD individually. This is where we are today with IO queueing.
Which would cost us about 7ms for 10K filedescriptors of raw syscall
overhead on my laptop, discounting the work poll() needs to do which in
non-blocking mode should be quite linear to the amount of
filedescriptors except for the issue with non-linearity in
filedescriptor number causing poll of a single high filedescriptor to
burn considerably more CPU than a low one, making poll() a very poor
example due to it's bad design.
> Yes, LIO. Problem is that most current LIO implementations are done
> in library by use of aio calls per FD. And as aio is typically done
> with thread per FD, this is unacceptable. Its important that kernel
> level syscall was there to take a list of commands, and equally return
> a list of results, not necessarily same set as requests were.
Varies. However, LIO shares the same poor notification mechanisms as AIO
plus that it mostly serializes processing, making it a major latency pig
even if you get down the CPU overhead.
> both, combined. Most of the time you poll just as a means to know
> when IO doesn't block. If you can enqueue the IO to the kernel and
> read results when it either completes or times out, you don't really
> need poll.
True, and is what the eventio framework mimics.
> ok. I should have realised that..
> Btw, why is close callback registration separated from the call?
> To follow existing code style more closely?
To make things easier. The callback is "call me when this handle has
been closed one way or another" and is registered when the filehandle is
created. It should be seen as a "filehandle gone" notification rather
than a "close notification". Generally there is no need for a close
notification unless you want to be absolutely sure the data has beed
sucessfully sent which is rarely the case.
> Hmm, I assumed that you cannot call read/write unless filehandle is
> provided by initial callback. Do you mean we can?
Currently not, but the framework can easily be extended to allow this if
it should become required. It is planned for. Actually thought I had
included this, but looking back into eventio I now notice I explicitly
excluded this use for now, most likely for cleanness and ease of
implementation.
> dunno, maybe this packing is a job for actual io-model behind the api.
> It just seems to me that it would be nice if we can pass array of
> such IOCB's to the api.
Kernel or userspace API?
For userspace it does not make much sense as there generally only is one
operation known at a time.
For kernel API it makes sense to cut down on the per-syscall overhead
when per-syscall overhead becomes a major problem. Build a queue of IO
requests and send them in batches to the kernel which then splits the
batch again into parallell processing, and returning batches of
completed events.
> pthreads. all DNS lookups could be done by one worker-thread, which
> might setup and handle multiple dnsservers, whatever. ACL-checks
> that eat CPU could also be done on separate cpu-threads.
> All this can only make sense if message-passing is very lightweight
> and does not require thread/context switching per action. Its about
> pipelining. Same or separate worker-thread could handle redirectors,
> messagepassing to/from them.
inter-thread communication unfortunately isn't very lightweight if you
need syncronisation between them. Because of this I prefer to distribute
all processing on all CPUs, where all procesing of a single request is
contained on the same CPU / thread. Or put in other words, one main
thread per processing unit (i.e. CPU, but the definition is becoming a
bit different these days)
There is a few areas that need to be synronised. Mainly the different
caches we have.
* disk
* memory
* DNS
Of these the disk cache and index is the biggest challenge, but it
should be doable I think.
> Suppose that we read from /dev/poll 10 results that notify about 10
> accepted sockets. We can now check all ACL's in main thread, or we
> can queue all 10 to separate thread, freeing main thread. If we also
> had 20 sockets ready for write, we could either handle them in main
> thread, or enqueue-append (gather) to a special worker-thread.
Or you could have N main threads, each caring for it's own
filedescriptors, and a supervisor directing the main threads to eaven
out the load between them. Requires far less syncronisation and
inter-thread communication and also allows the work done by the main
thread to scale.
> Yes. Just seems that it would be easier to start using threads
> for limited tasks, at least meanwhile.
Not so sure about this.. Squid assumes very much that there is a single
thread at a time using the memory with I/O operations being sycronous
but non-blocking. This is one of the reasons why eventio will take time
to implement as it defines all I/O as being asyncronous, non-blocking or
not is up to the I/O backend.
What I can tell is that your idea with worker-threads fits very well
into the eventio API, so once eventio is in place things like this can
very easily be tried.
To set things straight: When I talk about eventio I mostly talk about
the API, not the current implementation. The implementation that exists
is an implementation of eventio suitable for most "I/O possible"
notification mechanisms (select, poll, devpoll, Linux RT signals, ...)
with a backend for poll implemented, but the scope is not limited to
such mechanisms. The intention is that there will later be other eventio
implementations, for example for the Windows NT completetion ports, AIO
or other "different" mechanisms.
-- HenrikReceived on Mon Sep 17 2001 - 15:28:12 MDT
This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:14:21 MST